# Data Cleaning Notebook

## EmojiInsight Project

This notebook is dedicated to the data cleaning and preprocessing steps for the **EmojiInsight** project. We will parse and process the raw data files, particularly focusing on the `emoji-test.txt` file from Unicode, to prepare the data for analysis and modeling.

---

### 1. Importing Necessary Libraries

First, we import the essential libraries required for data processing and manipulation. We also adjust the Python path to include our project's root directory so we can import custom modules from the `src` directory.

In [55]:
import sys
import os
import pandas as pd

### 2. Adjusting `sys.path`

Since our notebook is located in the `notebooks/` directory and our custom modules are in the `src/` directory, we need to add the parent directory of our project to `sys.path`. This allows us to import modules from `src` without issues.

In [56]:
# Get the absolute path of the parent directory (EmojiInsight)
parent_dir = os.path.abspath('..')

# Add the parent directory to sys.path if it's not already there
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

### 3. Importing the `data_processing` Module

Now we can import our custom data processing module, which contains functions for loading and parsing data.

In [57]:
import src.data_processing as dp

### 4. Parsing the `emoji-test.txt` File

We will use the `parse_emoji_test` function from our `data_processing` module to parse the `emoji-test.txt` file. This file contains comprehensive information about all standardized emojis.

In [58]:
# Define the file path to emoji-test.txt
file_path = os.path.join(parent_dir, 'data', 'raw', 'emoji_data', 'emoji-test.txt')

# Parse the emoji-test.txt file
emoji_df = dp.parse_emoji_test(file_path)

### 5. Saving the Parsed Data

After parsing, we'll save the resulting DataFrame to the `data/processed/` directory for future use in our analysis and modeling.

In [59]:
# Define the output path for the processed data
output_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_list.csv')

# Ensure the processed data directory exists
processed_data_dir = os.path.dirname(output_path)
os.makedirs(processed_data_dir, exist_ok=True)

# Save the DataFrame to the processed data directory
emoji_df.to_csv(output_path, index=False)


### 6. Viewing the Parsed Data

Let's take a look at the first few rows of the parsed data to verify that everything was processed correctly.

In [60]:
emoji_df.head()

Unnamed: 0,group,subgroup,codepoints,status,emoji,name
0,Smileys & Emotion,face-smiling,1F600,fully-qualified,😀,E1.0 grinning face
1,Smileys & Emotion,face-smiling,1F603,fully-qualified,😃,E0.6 grinning face with big eyes
2,Smileys & Emotion,face-smiling,1F604,fully-qualified,😄,E0.6 grinning face with smiling eyes
3,Smileys & Emotion,face-smiling,1F601,fully-qualified,😁,E0.6 beaming face with smiling eyes
4,Smileys & Emotion,face-smiling,1F606,fully-qualified,😆,E0.6 grinning squinting face


### 7. Summary of Parsed Data

We can also get a summary of the DataFrame to understand the structure and content of the parsed data.

In [61]:
emoji_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5034 entries, 0 to 5033
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   group       5034 non-null   object
 1   subgroup    5034 non-null   object
 2   codepoints  5034 non-null   object
 3   status      5034 non-null   object
 4   emoji       5034 non-null   object
 5   name        5034 non-null   object
dtypes: object(6)
memory usage: 236.1+ KB


### 8. Next Steps

Now that we've successfully parsed and saved the Unicode emoji data, the next steps include:

- **Cleaning and preprocessing** the `emoji_sentiment_data_v1.0.csv` file.
- **Merging** the parsed emoji data with the sentiment data.
- **Exploring** the combined data to gain insights and prepare it for modeling.

---

## 9. Cleaning and Preprocessing the `emoji_sentiment_data_v1.0.csv` File

In this section, we'll load the Emoji Sentiment Data, perform necessary cleaning, and prepare it for merging with the parsed Unicode emoji data.

---

### 10.1. Loading the Dataset

We will load the `emoji_sentiment_data_v1.0.csv` file into a pandas DataFrame for analysis.

In [62]:
# Define the file path to the dataset
sentiment_file_path = os.path.join(parent_dir, 'data', 'raw', 'emoji_data', 'Emoji_Sentiment_Data_v1.0.csv')

# Check if the file exists to avoid errors
if not os.path.exists(sentiment_file_path):
    raise FileNotFoundError(f"The file {sentiment_file_path} does not exist. Please check the file path.")

# Load the dataset
sentiment_df = pd.read_csv(sentiment_file_path)

# Display the first few rows
sentiment_df.head()


Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons


### 10.2. Exploring the Dataset

Let's explore the dataset to understand its structure, data types, and identify any potential issues.

In [63]:
# Get basic information about the DataFrame
sentiment_df.info()
# Check for missing values
sentiment_df.isnull().sum()
# Get summary statistics
sentiment_df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 969 entries, 0 to 968
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Emoji              969 non-null    object 
 1   Unicode codepoint  969 non-null    object 
 2   Occurrences        969 non-null    int64  
 3   Position           969 non-null    float64
 4   Negative           969 non-null    int64  
 5   Neutral            969 non-null    int64  
 6   Positive           969 non-null    int64  
 7   Unicode name       969 non-null    object 
 8   Unicode block      969 non-null    object 
dtypes: float64(1), int64(4), object(4)
memory usage: 68.3+ KB


Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block
count,969,969,969.0,969.0,969.0,969.0,969.0,969,969
unique,969,969,,,,,,969,26
top,📔,0x1f4d4,,,,,,NOTEBOOK WITH DECORATIVE COVER,Miscellaneous Symbols and Pictographs
freq,1,1,,,,,,1,484
mean,,,161.961816,0.65498,24.873065,48.942208,88.146543,,
std,,,713.675266,0.195021,153.120845,189.329228,424.849727,,
min,,,1.0,0.009615,0.0,0.0,0.0,,
25%,,,5.0,0.557252,0.0,2.0,2.0,,
50%,,,18.0,0.68,2.0,7.0,7.0,,
75%,,,77.0,0.793676,8.0,28.0,36.0,,


### 10.3. Data Cleaning

Based on the exploration, we'll perform the following cleaning steps:

- **Rename Columns:** Standardize column names for consistency.
- **Handle Missing Values:** Address any missing data if present.
- **Correct Data Types:** Ensure columns have appropriate data types.
- **Standardize Unicode Codepoints:** Prepare codepoints for merging with the Unicode emoji data.

In [64]:
# Rename columns for consistency
sentiment_df.rename(columns={
    'Emoji': 'emoji',
    'Unicode codepoint': 'codepoint',
    'Occurrences': 'occurrences',
    'Position': 'position',
    'Negative': 'negative',
    'Neutral': 'neutral',
    'Positive': 'positive',
    'Unicode name': 'name',
    'Unicode block': 'block'
}, inplace=True)

# Verify the changes
sentiment_df.head()

Unnamed: 0,emoji,codepoint,occurrences,position,negative,neutral,positive,name,block
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons


In [65]:
# Check for missing values again
sentiment_df.isnull().sum()
# Drop rows with missing values (if appropriate)

emoji          0
codepoint      0
occurrences    0
position       0
negative       0
neutral        0
positive       0
name           0
block          0
dtype: int64

In [66]:
# Ensure numeric columns are of numeric data types
numeric_columns = ['occurrences', 'position', 'negative', 'neutral', 'positive']
sentiment_df[numeric_columns] = sentiment_df[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Verify data types
sentiment_df.dtypes

emoji           object
codepoint       object
occurrences      int64
position       float64
negative         int64
neutral          int64
positive         int64
name            object
block           object
dtype: object

In [67]:
# Remove '0x' prefix and convert codepoints to uppercase
sentiment_df['codepoint_std'] = sentiment_df['codepoint'].str.replace('0x', '').str.upper()

### 10.4. Saving the Cleaned Data

After cleaning, we'll save the cleaned DataFrame to the `data/processed/` directory for future use.

In [68]:
# Define the output path for the cleaned data
cleaned_sentiment_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_sentiment_data_cleaned.csv')

# Save the cleaned DataFrame
sentiment_df.to_csv(cleaned_sentiment_path, index=False)

### 10.5.1. Verify Columns in `sentiment_df` and `emoji_df`

Before merging, let's ensure that both DataFrames contain the `'codepoint_std'` column.

In [69]:
# Check columns in sentiment_df
print("Columns in sentiment_df:", sentiment_df.columns.tolist())

# Check columns in emoji_df
print("Columns in emoji_df:", emoji_df.columns.tolist())

Columns in sentiment_df: ['emoji', 'codepoint', 'occurrences', 'position', 'negative', 'neutral', 'positive', 'name', 'block', 'codepoint_std']
Columns in emoji_df: ['group', 'subgroup', 'codepoints', 'status', 'emoji', 'name']


In [70]:
# Verify 'codepoint_std' in sentiment_df
if 'codepoint_std' in sentiment_df.columns:
    print("The 'codepoint_std' column exists in sentiment_df.")
else:
    print("The 'codepoint_std' column is missing in sentiment_df.")

# Verify 'codepoint_std' in emoji_df
if 'codepoint_std' in emoji_df.columns:
    print("The 'codepoint_std' column exists in emoji_df.")
else:
    print("The 'codepoint_std' column is missing in emoji_df.")

The 'codepoint_std' column exists in sentiment_df.
The 'codepoint_std' column is missing in emoji_df.


#### 10.5.2. Creating `'codepoint_std'` in `emoji_df`

It appears that the `'codepoint_std'` column is missing in `emoji_df`. We'll create it by standardizing the `'codepoints'` column.

In [71]:
# Check if 'codepoints' exists in emoji_df
if 'codepoints' in emoji_df.columns:
    # Create 'codepoint_std'
    emoji_df['codepoint_std'] = emoji_df['codepoints'].str.replace(' ', '-').str.upper()
    print("'codepoint_std' column created in emoji_df.")
else:
    print("The 'codepoints' column is missing in emoji_df. Please verify the data.")
    
# Verify creation
if 'codepoint_std' in emoji_df.columns:
    print("Verification: 'codepoint_std' exists in emoji_df.")
else:
    raise KeyError("Failed to create 'codepoint_std' in emoji_df.")


'codepoint_std' column created in emoji_df.
Verification: 'codepoint_std' exists in emoji_df.


### 10.5.3. Merging `sentiment_df` with `emoji_df`

With the `'codepoint_std'` column present in both DataFrames, we'll perform the merge to enrich our dataset.

In [72]:
# Perform the merge on 'codepoint_std'
merged_df = pd.merge(
    sentiment_df,
    emoji_df,
    on='codepoint_std',
    how='inner',
    suffixes=('_sentiment', '_emoji')  # To handle any overlapping column names
)

# Display the columns after merge
print("Columns in merged_df after merge:", merged_df.columns.tolist())

# Display the first few rows to verify
merged_df.head()

Columns in merged_df after merge: ['emoji_sentiment', 'codepoint', 'occurrences', 'position', 'negative', 'neutral', 'positive', 'name_sentiment', 'block', 'codepoint_std', 'group', 'subgroup', 'codepoints', 'status', 'emoji_emoji', 'name_emoji']


Unnamed: 0,emoji_sentiment,codepoint,occurrences,position,negative,neutral,positive,name_sentiment,block,codepoint_std,group,subgroup,codepoints,status,emoji_emoji,name_emoji
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons,1F602,Smileys & Emotion,face-smiling,1F602,fully-qualified,😂,E0.6 face with tears of joy
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats,2764,Smileys & Emotion,heart,2764,unqualified,❤,E0.6 red heart
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols,2665,Activities,game,2665,unqualified,♥,E0.6 heart suit
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons,1F60D,Smileys & Emotion,face-affection,1F60D,fully-qualified,😍,E0.6 smiling face with heart-eyes
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons,1F62D,Smileys & Emotion,face-concerned,1F62D,fully-qualified,😭,E0.6 loudly crying face


### 10.5. Merging with the Parsed Unicode Emoji Data

To enrich our dataset, we'll merge the cleaned sentiment data with the parsed Unicode emoji data on the standardized codepoints.

In [73]:
# Ensure the parsed emoji data is loaded
# If not already loaded, load it from the processed data
emoji_list_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_list.csv')
emoji_df = pd.read_csv(emoji_list_path)

# Standardize codepoints in the emoji data
emoji_df['codepoint_std'] = emoji_df['codepoints'].str.replace(' ', '-').str.upper()


consolidate 'emoji_sentiment' and 'emoji_emoji' into a single 'emoji' column and remove the redundant

In [74]:
# Consolidate 'emoji_sentiment' and 'emoji_emoji' into a single 'emoji' column
if 'emoji_sentiment' in merged_df.columns and 'emoji_emoji' in merged_df.columns:
    # Prefer 'emoji_sentiment' as the primary emoji column
    merged_df['emoji'] = merged_df['emoji_sentiment']
    # Drop the redundant 'emoji_emoji' column
    merged_df.drop(['emoji_sentiment', 'emoji_emoji'], axis=1, inplace=True)
    print("'emoji_sentiment' column has been renamed to 'emoji' and 'emoji_emoji' has been dropped.")
elif 'emoji_sentiment' in merged_df.columns:
    merged_df.rename(columns={'emoji_sentiment': 'emoji'}, inplace=True)
    print("'emoji_sentiment' column has been renamed to 'emoji'.")
elif 'emoji_emoji' in merged_df.columns:
    merged_df.rename(columns={'emoji_emoji': 'emoji'}, inplace=True)
    print("'emoji_emoji' column has been renamed to 'emoji'.")
else:
    print("No duplicate 'emoji' columns found.")

# Verify the 'emoji' column
if 'emoji' in merged_df.columns:
    print("Verification: 'emoji' column is present in merged_df.")
    print(f"'emoji' column dtype: {merged_df['emoji'].dtype}")
    print(merged_df['emoji'].head())
else:
    print("'emoji' column is missing in merged_df after consolidation.")


'emoji_sentiment' column has been renamed to 'emoji' and 'emoji_emoji' has been dropped.
Verification: 'emoji' column is present in merged_df.
'emoji' column dtype: object
0    😂
1    ❤
2    ♥
3    😍
4    😭
Name: emoji, dtype: object


### 10.6. Saving the Merged Data

We will save the merged DataFrame for future analysis and modeling.

In [75]:
# Define the output path for the merged data
merged_data_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_data_merged.csv')

# Save the merged DataFrame
merged_df.to_csv(merged_data_path, index=False)

### 10.7. Verifying the Merged Data

Let's check the structure of the merged data to ensure that the merge was successful.

In [76]:
# Get info about the merged DataFrame
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 734 entries, 0 to 733
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   codepoint       734 non-null    object 
 1   occurrences     734 non-null    int64  
 2   position        734 non-null    float64
 3   negative        734 non-null    int64  
 4   neutral         734 non-null    int64  
 5   positive        734 non-null    int64  
 6   name_sentiment  734 non-null    object 
 7   block           734 non-null    object 
 8   codepoint_std   734 non-null    object 
 9   group           734 non-null    object 
 10  subgroup        734 non-null    object 
 11  codepoints      734 non-null    object 
 12  status          734 non-null    object 
 13  name_emoji      734 non-null    object 
 14  emoji           734 non-null    object 
dtypes: float64(1), int64(4), object(10)
memory usage: 86.1+ KB


In [77]:
# Check for any missing values
merged_df.isnull().sum()

codepoint         0
occurrences       0
position          0
negative          0
neutral           0
positive          0
name_sentiment    0
block             0
codepoint_std     0
group             0
subgroup          0
codepoints        0
status            0
name_emoji        0
emoji             0
dtype: int64

### 11.9. Emoji Sentiment Distribution by Category

We'll use the emoji-data repository, which provides comprehensive metadata for emojis, including their categories. We'll download the emoji.json file, parse it, and create a mapping dictionary to associate each emoji with its category.

In [78]:
# Install necessary libraries if not already installed
%pip install pandas plotly requests

import pandas as pd
import plotly.express as px
import requests
import json
import os

Note: you may need to restart the kernel to use updated packages.


## Fetching Emoji Data
We'll fetch comprehensive emoji metadata from the emoji-data repository. This data includes information about each emoji, such as its category and unified code points.

In [79]:
# Correct URL to the emoji-data JSON data
emoji_data_url = "https://raw.githubusercontent.com/iamcal/emoji-data/master/emoji.json"

response = requests.get(emoji_data_url)

if response.status_code == 200:
    emoji_data = response.json()
    print("Emoji-data fetched successfully.")
else:
    print(f"Failed to fetch emoji-data. Status code: {response.status_code}")

Emoji-data fetched successfully.


## Creating Emoji to Category Mapping

Using the fetched emoji_data, we'll create a mapping from each emoji character to its respective category.

In [80]:
# Create a mapping dictionary: emoji -> category
emoji_category_mapping = {}

for emoji_entry in emoji_data:
    emoji_char = emoji_entry.get('emoji')
    category = emoji_entry.get('category')
    
    if emoji_char and category:
        emoji_category_mapping[emoji_char] = category.capitalize()

sample_mapping = dict(list(emoji_category_mapping.items())[:10])
print("Sample Emoji to Category Mapping:")
for emoji_char, cat in sample_mapping.items():
    print(f"{emoji_char}: {cat}")


Sample Emoji to Category Mapping:


## Parsing emoji-test.txt
The emoji-test.txt file contains a comprehensive list of emojis along with their unified code points. We'll parse this file to create a mapping from unified code points to emoji characters.

# Verify the File Path
Ensure that the emoji-test.txt file exists at the specified path.

In [81]:
# Define the correct absolute path to the emoji-test.txt file
emoji_test_path = '/workspace/EmojiInsight/data/raw/emoji_data/emoji-test.txt'

if os.path.isfile(emoji_test_path):
    print(f"File found at: {emoji_test_path}")
else:
    print(f"File not found at: {emoji_test_path}")


File found at: /workspace/EmojiInsight/data/raw/emoji_data/emoji-test.txt


In [82]:
# Directory containing the emoji-test.txt file
emoji_dir = '/workspace/EmojiInsight/data/raw/emoji_data/'

try:
    contents = os.listdir(emoji_dir)
    print(f"Contents of '{emoji_dir}':")
    for item in contents:
        print(item)
except FileNotFoundError:
    print(f"The directory '{emoji_dir}' does not exist.")


Contents of '/workspace/EmojiInsight/data/raw/emoji_data/':
ESR_v1.0_format.txt
Emoji_Sentiment_Data_v1.0.csv
emoji-test.txt


## Define Conversion Function
This function converts unified code point strings (e.g., "1F469-200D-1F4BB") into actual emoji characters (e.g., 👩‍💻).

In [83]:
def unified_to_emoji_func(unified_str):
    """
    Convert a unified code point string to the actual emoji character.
    Handles multiple code points separated by '-'.
    """
    try:
        code_points = unified_str.split('-')
        emoji_char = ''.join([chr(int(cp, 16)) for cp in code_points])
        return emoji_char
    except Exception as e:
        print(f"Error converting unified '{unified_str}': {e}")
        return None


## Parse emoji_test.txt to Create unified_to_emoji Dictionary

Parse the emoji-test.txt file to create a mapping from unified code points to emoji characters.

In [84]:
def parse_emoji_test(file_path):
    """
    Parse the emoji_test.txt file to create a mapping from 'unified' code points to emoji characters.
    """
    unified_to_emoji = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            if not line.strip() or line.startswith('#'):
                continue
            parts = line.split(';')
            if len(parts) < 2:
                continue
            unified = parts[0].strip()
            if '#' in parts[1]:
                emoji_part = parts[1].split('#')[1].strip()
                emoji_char = emoji_part.split(' ')[0]
                unified_to_emoji[unified] = emoji_char
    return unified_to_emoji

unified_to_emoji = parse_emoji_test(emoji_test_path)
print(f"Total emojis parsed from emoji_test.txt: {len(unified_to_emoji)}")



Total emojis parsed from emoji_test.txt: 5034


## Creating the Unified to Category Mapping

Using both emoji_data and the unified_to_emoji mapping, we'll create a comprehensive mapping from each emoji character to its category.

In [85]:
# Initialize the mapping dictionary
unified_to_category = {}

for emoji_entry in emoji_data:
    unified = emoji_entry.get('unified')
    category = emoji_entry.get('category')
    
    if unified and category:
        emoji_char = unified_to_emoji_func(unified)
        if emoji_char:
            unified_to_category[emoji_char] = category.capitalize()

print(f"Total emojis in unified_to_category: {len(unified_to_category)}")

Total emojis in unified_to_category: 1903


## Mapping Emojis in the Dataset

DataFrame named merged_df that contains an 'emoji' column, we'll map each emoji to its corresponding category.

In [86]:
# Define emoji_to_category as a copy of unified_to_category
emoji_to_category = unified_to_category.copy()


def get_emoji_category(emoji_char):
    return emoji_to_category.get(emoji_char, 'Unknown')
merged_df['category'] = merged_df['emoji'].apply(get_emoji_category)
print("First 10 entries with 'emoji' and 'category':")
print(merged_df[['emoji', 'category']].head(10))

First 10 entries with 'emoji' and 'category':
  emoji           category
0     😂  Smileys & emotion
1     ❤            Unknown
2     ♥            Unknown
3     😍  Smileys & emotion
4     😭  Smileys & emotion
5     😘  Smileys & emotion
6     😊  Smileys & emotion
7     👌      People & body
8     💕  Smileys & emotion
9     👏      People & body


## Handling 'Unknown' Emojis
After initial mapping, some emojis may still be categorized as 'Unknown'. We'll identify and attempt to resolve these.

In [87]:
# Identify 'Unknown' emojis in merged_df
unknown_emojis = merged_df[merged_df['category'] == 'Unknown']['emoji'].unique()

print(f"Total 'Unknown' Emojis: {len(unknown_emojis)}")
print("List of 'Unknown' Emojis:")
for emoji_char in unknown_emojis:
    print(emoji_char)

Total 'Unknown' Emojis: 58
List of 'Unknown' Emojis:
❤
♥
☺
✌
☯
☀
❄
✈
✔
➡
✖
☝
▪
☑
▶
☁
⚠
♦
✂
⬆
✏
♣
♨
☠
◀
☹
♠
♻
⬇
☎
⬅
✳
✒
✉
↪
☂
🅰
☮
♂
Ⓜ
🅾
☄
✡
☪
❇
✴
♀
↩
🅱
☃
↙
✍
▫
☢
✝
❣
↕
🈂


## Resolve 'Unknown' Emojis Using Existing Mappings
Resolve these 'Unknown' emojis by referencing the unified_to_emoji and emoji_to_category mappings.

In [88]:
# Function to resolve 'Unknown' emojis using existing mappings
def resolve_unknown_emojis(unknown_emojis, unified_to_emoji, emoji_to_category):
    """
    Attempts to resolve 'Unknown' emojis by matching them with the unified_to_emoji mapping.
    """
    for emoji_char in unknown_emojis:
        unified_codes = [unified for unified, emo in unified_to_emoji.items() if emo == emoji_char]
        
        if unified_codes:
            unified = unified_codes[0]
            category = emoji_to_category.get(emoji_char, 'Unknown')
            if category != 'Unknown':
                emoji_to_category[emoji_char] = category
                merged_df.loc[merged_df['emoji'] == emoji_char, 'category'] = category
            else:
                print(f"Category for emoji '{emoji_char}' is still 'Unknown'.")
        else:
            print(f"No unified codepoints found for emoji '{emoji_char}'.")

resolve_unknown_emojis(unknown_emojis, unified_to_emoji, emoji_to_category)
merged_df['category'] = merged_df['emoji'].apply(lambda x: emoji_to_category.get(x, 'Unknown'))
remaining_unknowns = merged_df[merged_df['category'] == 'Unknown']['emoji'].unique()

if len(remaining_unknowns) == 0:
    print("All emojis have been successfully categorized.")
else:
    print(f"Remaining 'Unknown' Emojis: {len(remaining_unknowns)}")
    print("List of Remaining 'Unknown' Emojis:")
    for emoji_char in remaining_unknowns:
        print(emoji_char)


Category for emoji '❤' is still 'Unknown'.
Category for emoji '♥' is still 'Unknown'.
Category for emoji '☺' is still 'Unknown'.
Category for emoji '✌' is still 'Unknown'.
Category for emoji '☯' is still 'Unknown'.
Category for emoji '☀' is still 'Unknown'.
Category for emoji '❄' is still 'Unknown'.
Category for emoji '✈' is still 'Unknown'.
Category for emoji '✔' is still 'Unknown'.
Category for emoji '➡' is still 'Unknown'.
Category for emoji '✖' is still 'Unknown'.
Category for emoji '☝' is still 'Unknown'.
Category for emoji '▪' is still 'Unknown'.
Category for emoji '☑' is still 'Unknown'.
Category for emoji '▶' is still 'Unknown'.
Category for emoji '☁' is still 'Unknown'.
Category for emoji '⚠' is still 'Unknown'.
Category for emoji '♦' is still 'Unknown'.
Category for emoji '✂' is still 'Unknown'.
Category for emoji '⬆' is still 'Unknown'.
Category for emoji '✏' is still 'Unknown'.
Category for emoji '♣' is still 'Unknown'.
Category for emoji '♨' is still 'Unknown'.
Category fo

## Manual Mapping for Remaining 'Unknown' Emojis
Some emojis may not be present in the emoji_data or require manual categorization. We'll create a manual mapping dictionary to handle these cases.

In [89]:
# Manual mapping for remaining 'Unknown' emojis
manual_mapping = {
    '❤': 'Symbols',
    '♥': 'Symbols',
    '☺': 'Smileys & Emotion',
    '✌': 'Gestures',
    '☯': 'Symbols',
    '☀': 'Symbols',
    '❄': 'Symbols',
    '✈': 'Travel & Places',
    '✔': 'Symbols',
    '➡': 'Symbols',
    '✖': 'Symbols',
    '☝': 'Gestures',
    '▪': 'Symbols',
    '☑': 'Symbols',
    '▶': 'Symbols',
    '☁': 'Symbols',
    '⚠': 'Symbols',
    '♦': 'Symbols',
    '✂': 'Objects',
    '⬆': 'Symbols',
    '✏': 'Objects',
    '♣': 'Symbols',
    '♨': 'Symbols',
    '☠': 'Symbols',
    '◀': 'Symbols',
    '☹': 'Smileys & Emotion',
    '♠': 'Symbols',
    '♻': 'Symbols',
    '⬇': 'Symbols',
    '☎': 'Objects',
    '⬅': 'Symbols',
    '✳': 'Symbols',
    '✒': 'Objects',
    '✉': 'Objects',
    '↪': 'Symbols',
    '☂': 'Symbols',
    '🅰': 'Symbols',
    '☮': 'Symbols',
    '♂': 'Symbols',
    'Ⓜ': 'Symbols',
    '🅾': 'Symbols',
    '☄': 'Symbols',
    '✡': 'Symbols',
    '☪': 'Symbols',
    '❇': 'Symbols',
    '✴': 'Symbols',
    '♀': 'Symbols',
    '↩': 'Symbols',
    '🅱': 'Symbols',
    '☃': 'Symbols',
    '↙': 'Symbols',
    '✍': 'Gestures',
    '▫': 'Symbols',
    '☢': 'Symbols',
    '✝': 'Symbols',
    '❣': 'Symbols',
    '↕': 'Symbols',
    '🈂': 'Symbols'
}

emoji_to_category.update(manual_mapping)
merged_df['category'] = merged_df['emoji'].apply(lambda x: emoji_to_category.get(x, 'Unknown'))
remaining_unknowns = merged_df[merged_df['category'] == 'Unknown']['emoji'].unique()

if len(remaining_unknowns) == 0:
    print("All emojis have been successfully categorized.")
else:
    print(f"Remaining 'Unknown' Emojis: {len(remaining_unknowns)}")
    print("List of Remaining 'Unknown' Emojis:")
    for emoji_char in remaining_unknowns:
        print(emoji_char)


All emojis have been successfully categorized.


## Identify the Inconsistencies

After creating the 'category' column in merged_df, we need to check for inconsistencies in category labels.

In [90]:
# List unique categories before cleaning
print("Unique categories before cleaning:")
print(merged_df['category'].unique())

Unique categories before cleaning:
['Smileys & emotion' 'Symbols' 'People & body' 'Smileys & Emotion'
 'Gestures' 'Activities' 'Animals & nature' 'Objects' 'Travel & places'
 'Food & drink' 'Travel & Places' 'Flags']


## Standardize Category Labels

We'll standardize the category labels by:

 - Converting all labels to title case to handle capitalization inconsistencies.
 - Replacing variations of 'and'/'&' to a standard '&' symbol.
 - Removing leading/trailing whitespace.

In [91]:
def clean_category_label(label):
    label = label.strip()  # Remove leading/trailing whitespace
    label = label.title()  # Convert to title case
    label = label.replace(' And ', ' & ')  # Replace ' And ' with ' & '
    label = label.replace(' and ', ' & ')  # Handle lowercase 'and'
    label = label.replace(' & ', ' & ')  # Ensure consistent spacing
    return label

# Apply the function to 'category' column
merged_df['category'] = merged_df['category'].apply(clean_category_label)

## Verify the Standardization

After cleaning, we should verify that the categories are standardized.

In [92]:
# List unique categories after cleaning
print("Unique categories after cleaning:")
print(merged_df['category'].unique())

Unique categories after cleaning:
['Smileys & Emotion' 'Symbols' 'People & Body' 'Gestures' 'Activities'
 'Animals & Nature' 'Objects' 'Travel & Places' 'Food & Drink' 'Flags']


## Verification and Final Checks
Finally, we'll verify that all emojis in the dataset are correctly categorized and perform some basic analysis.

### Verify Total Mappings

In [93]:
# Verify the total number of emojis in unified_to_category and emoji_to_category
print(f"Total emojis in unified_to_category: {len(unified_to_category)}")
print(f"Total emojis in emoji_to_category: {len(emoji_to_category)}")

Total emojis in unified_to_category: 1903
Total emojis in emoji_to_category: 1961


In [94]:
# Display a sample of the emoji_category_mapping
sample_category_mapping = dict(list(emoji_category_mapping.items())[:10])
print("Sample Emoji to Category Mapping:")
for emoji_char, cat in sample_category_mapping.items():
    print(f"{emoji_char}: {cat}")

Sample Emoji to Category Mapping:


In [95]:
# Display a sample of the merged_df DataFrame
print("First 10 entries with 'emoji' and 'category':")
print(merged_df[['emoji', 'category']].head(10))

First 10 entries with 'emoji' and 'category':
  emoji           category
0     😂  Smileys & Emotion
1     ❤            Symbols
2     ♥            Symbols
3     😍  Smileys & Emotion
4     😭  Smileys & Emotion
5     😘  Smileys & Emotion
6     😊  Smileys & Emotion
7     👌      People & Body
8     💕  Smileys & Emotion
9     👏      People & Body


## Analyze Emoji Categories
We'll analyze the distribution of emojis across different categories.

In [96]:
# Check the number of unique categories in merged_df
unique_categories = merged_df['category'].unique()
print("Unique Emoji Categories in merged_df:")
print(unique_categories)

Unique Emoji Categories in merged_df:
['Smileys & Emotion' 'Symbols' 'People & Body' 'Gestures' 'Activities'
 'Animals & Nature' 'Objects' 'Travel & Places' 'Food & Drink' 'Flags']


In [97]:
# Count of emojis per category
category_counts = merged_df['category'].value_counts()
print("\nEmoji Counts per Category:")
print(category_counts)


Emoji Counts per Category:
category
Symbols              130
Objects              126
Travel & Places      118
Smileys & Emotion    102
Animals & Nature      82
People & Body         69
Food & Drink          61
Activities            40
Gestures               3
Flags                  3
Name: count, dtype: int64


## Final Verification

Ensure that there are no remaining 'Unknown' emojis.

In [98]:
# Final verification
remaining_unknowns = merged_df[merged_df['category'] == 'Unknown']['emoji'].unique()

if len(remaining_unknowns) == 0:
    print("All emojis have been successfully categorized.")
else:
    print(f"Remaining 'Unknown' Emojis: {len(remaining_unknowns)}")
    print("List of Remaining 'Unknown' Emojis:")
    for emoji_char in remaining_unknowns:
        print(emoji_char)

All emojis have been successfully categorized.


## Finalizing and Saving merged_df

Now all all cleaning, mapping, and categorization steps in data_cleaning.ipynb have been carried out, add the following code to save merged_df:

In [99]:
# %%
import os
import pandas as pd

# Define the directory to save processed data
processed_dir = '/workspace/EmojiInsight/data/processed/'
os.makedirs(processed_dir, exist_ok=True)
merged_data_path = os.path.join(processed_dir, 'emoji_data_merged.csv')

# Save merged_df to CSV
merged_df.to_csv(merged_data_path, index=False)
print(f"Cleaned merged data saved to {merged_data_path}")


Cleaned merged data saved to /workspace/EmojiInsight/data/processed/emoji_data_merged.csv


## Verifying the Saved File

To ensure that the DataFrame is saved correctly, we add the following verification code right after saving:

In [100]:
if os.path.isfile(merged_data_path):
    print(f"File {merged_data_path} exists.")

File /workspace/EmojiInsight/data/processed/emoji_data_merged.csv exists.


### Next Steps

With the sentiment data loaded, renamed, and merged with the Unicode emoji data, we're now ready to proceed with:

- **Exploratory Data Analysis (EDA):** Uncover patterns and insights in the data.
- **Feature Engineering:** Create new features that may enhance our predictive models.
- **Model Development:** Begin building machine learning models to meet our project objectives.