# Data Cleaning Notebook

## EmojiInsight Project

This notebook is dedicated to the data cleaning and preprocessing steps for the **EmojiInsight** project. We will parse and process the raw data files, particularly focusing on the `emoji-test.txt` file from Unicode, to prepare the data for analysis and modeling.

---

### 1. Importing Necessary Libraries

First, we import the essential libraries required for data processing and manipulation. We also adjust the Python path to include our project's root directory so we can import custom modules from the `src` directory.

In [4]:
import sys
import os
import pandas as pd

### 2. Adjusting `sys.path`

Since our notebook is located in the `notebooks/` directory and our custom modules are in the `src/` directory, we need to add the parent directory of our project to `sys.path`. This allows us to import modules from `src` without issues.

In [5]:
# Get the absolute path of the parent directory (EmojiInsight)
parent_dir = os.path.abspath('..')

# Add the parent directory to sys.path if it's not already there
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

### 3. Importing the `data_processing` Module

Now we can import our custom data processing module, which contains functions for loading and parsing data.

In [6]:
import src.data_processing as dp

### 4. Parsing the `emoji-test.txt` File

We will use the `parse_emoji_test` function from our `data_processing` module to parse the `emoji-test.txt` file. This file contains comprehensive information about all standardized emojis.

In [7]:
# Define the file path to emoji-test.txt
file_path = os.path.join(parent_dir, 'data', 'raw', 'emoji_data', 'emoji-test.txt')

# Parse the emoji-test.txt file
emoji_df = dp.parse_emoji_test(file_path)

### 5. Saving the Parsed Data

After parsing, we'll save the resulting DataFrame to the `data/processed/` directory for future use in our analysis and modeling.

In [8]:
# Define the output path for the processed data
output_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_list.csv')

# Ensure the processed data directory exists
processed_data_dir = os.path.dirname(output_path)
os.makedirs(processed_data_dir, exist_ok=True)

# Save the DataFrame to the processed data directory
emoji_df.to_csv(output_path, index=False)


### 6. Viewing the Parsed Data

Let's take a look at the first few rows of the parsed data to verify that everything was processed correctly.

In [None]:
emoji_df.head()

### 7. Summary of Parsed Data

We can also get a summary of the DataFrame to understand the structure and content of the parsed data.

In [None]:
emoji_df.info()

### 8. Next Steps

Now that we've successfully parsed and saved the Unicode emoji data, the next steps include:

- **Cleaning and preprocessing** the `emoji_sentiment_data_v1.0.csv` file.
- **Merging** the parsed emoji data with the sentiment data.
- **Exploring** the combined data to gain insights and prepare it for modeling.

---

## 9. Cleaning and Preprocessing the `emoji_sentiment_data_v1.0.csv` File

In this section, we'll load the Emoji Sentiment Data, perform necessary cleaning, and prepare it for merging with the parsed Unicode emoji data.

---

### 10.1. Loading the Dataset

We will load the `emoji_sentiment_data_v1.0.csv` file into a pandas DataFrame for analysis.

In [12]:
# Define the file path to the dataset
sentiment_file_path = os.path.join(parent_dir, 'data', 'raw', 'emoji_data', 'Emoji_Sentiment_Data_v1.0.csv')

# Load the dataset
sentiment_df = pd.read_csv(sentiment_file_path)

# Display the first few rows
sentiment_df.head()

Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons


### 10.2. Exploring the Dataset

Let's explore the dataset to understand its structure, data types, and identify any potential issues.

In [13]:
# Get basic information about the DataFrame
sentiment_df.info()
# Check for missing values
sentiment_df.isnull().sum()
# Get summary statistics
sentiment_df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 969 entries, 0 to 968
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Emoji              969 non-null    object 
 1   Unicode codepoint  969 non-null    object 
 2   Occurrences        969 non-null    int64  
 3   Position           969 non-null    float64
 4   Negative           969 non-null    int64  
 5   Neutral            969 non-null    int64  
 6   Positive           969 non-null    int64  
 7   Unicode name       969 non-null    object 
 8   Unicode block      969 non-null    object 
dtypes: float64(1), int64(4), object(4)
memory usage: 68.3+ KB


Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block
count,969,969,969.0,969.0,969.0,969.0,969.0,969,969
unique,969,969,,,,,,969,26
top,📔,0x1f4d4,,,,,,NOTEBOOK WITH DECORATIVE COVER,Miscellaneous Symbols and Pictographs
freq,1,1,,,,,,1,484
mean,,,161.961816,0.65498,24.873065,48.942208,88.146543,,
std,,,713.675266,0.195021,153.120845,189.329228,424.849727,,
min,,,1.0,0.009615,0.0,0.0,0.0,,
25%,,,5.0,0.557252,0.0,2.0,2.0,,
50%,,,18.0,0.68,2.0,7.0,7.0,,
75%,,,77.0,0.793676,8.0,28.0,36.0,,


### 10.3. Data Cleaning

Based on the exploration, we'll perform the following cleaning steps:

- **Rename Columns:** Standardize column names for consistency.
- **Handle Missing Values:** Address any missing data if present.
- **Correct Data Types:** Ensure columns have appropriate data types.
- **Standardize Unicode Codepoints:** Prepare codepoints for merging with the Unicode emoji data.

In [14]:
# Rename columns for consistency
sentiment_df.rename(columns={
    'Emoji': 'emoji',
    'Unicode codepoint': 'codepoint',
    'Occurrences': 'occurrences',
    'Position': 'position',
    'Negative': 'negative',
    'Neutral': 'neutral',
    'Positive': 'positive',
    'Unicode name': 'name',
    'Unicode block': 'block'
}, inplace=True)

In [15]:
# Check for missing values again
sentiment_df.isnull().sum()
# Drop rows with missing values (if appropriate)
# sentiment_df.dropna(inplace=True)

# Alternatively, fill missing values
# sentiment_df.fillna(method='ffill', inplace=True)

emoji          0
codepoint      0
occurrences    0
position       0
negative       0
neutral        0
positive       0
name           0
block          0
dtype: int64

In [16]:
# Ensure numeric columns are of numeric data types
numeric_columns = ['occurrences', 'position', 'negative', 'neutral', 'positive']
sentiment_df[numeric_columns] = sentiment_df[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Verify data types
sentiment_df.dtypes

emoji           object
codepoint       object
occurrences      int64
position       float64
negative         int64
neutral          int64
positive         int64
name            object
block           object
dtype: object

In [17]:
# Remove '0x' prefix and convert codepoints to uppercase
sentiment_df['codepoint_std'] = sentiment_df['codepoint'].str.replace('0x', '').str.upper()

### 10.4. Saving the Cleaned Data

After cleaning, we'll save the cleaned DataFrame to the `data/processed/` directory for future use.

In [18]:
# Define the output path for the cleaned data
cleaned_sentiment_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_sentiment_data_cleaned.csv')

# Save the cleaned DataFrame
sentiment_df.to_csv(cleaned_sentiment_path, index=False)

### 10.5. Merging with the Parsed Unicode Emoji Data

To enrich our dataset, we'll merge the cleaned sentiment data with the parsed Unicode emoji data on the standardized codepoints.

In [19]:
# Ensure the parsed emoji data is loaded
# If not already loaded, load it from the processed data
emoji_list_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_list.csv')
emoji_df = pd.read_csv(emoji_list_path)

# Standardize codepoints in the emoji data
emoji_df['codepoint_std'] = emoji_df['codepoints'].str.replace(' ', '-').str.upper()

# Merge the DataFrames on 'codepoint_std'
merged_df = pd.merge(sentiment_df, emoji_df, on='codepoint_std', how='inner')

# Display the merged DataFrame
merged_df.head()

Unnamed: 0,emoji_x,codepoint,occurrences,position,negative,neutral,positive,name_x,block,codepoint_std,group,subgroup,codepoints,status,emoji_y,name_y
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons,1F602,Smileys & Emotion,face-smiling,1F602,fully-qualified,😂,E0.6 face with tears of joy
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats,2764,Smileys & Emotion,heart,2764,unqualified,❤,E0.6 red heart
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols,2665,Activities,game,2665,unqualified,♥,E0.6 heart suit
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons,1F60D,Smileys & Emotion,face-affection,1F60D,fully-qualified,😍,E0.6 smiling face with heart-eyes
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons,1F62D,Smileys & Emotion,face-concerned,1F62D,fully-qualified,😭,E0.6 loudly crying face


### 10.6. Saving the Merged Data

We will save the merged DataFrame for future analysis and modeling.

In [20]:
# Define the output path for the merged data
merged_data_path = os.path.join(parent_dir, 'data', 'processed', 'emoji_data_merged.csv')

# Save the merged DataFrame
merged_df.to_csv(merged_data_path, index=False)

### 10.7. Verifying the Merged Data

Let's check the structure of the merged data to ensure that the merge was successful.

In [22]:
# Get info about the merged DataFrame
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 734 entries, 0 to 733
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   emoji_x        734 non-null    object 
 1   codepoint      734 non-null    object 
 2   occurrences    734 non-null    int64  
 3   position       734 non-null    float64
 4   negative       734 non-null    int64  
 5   neutral        734 non-null    int64  
 6   positive       734 non-null    int64  
 7   name_x         734 non-null    object 
 8   block          734 non-null    object 
 9   codepoint_std  734 non-null    object 
 10  group          734 non-null    object 
 11  subgroup       734 non-null    object 
 12  codepoints     734 non-null    object 
 13  status         734 non-null    object 
 14  emoji_y        734 non-null    object 
 15  name_y         734 non-null    object 
dtypes: float64(1), int64(4), object(11)
memory usage: 91.9+ KB


In [21]:
# Check for any missing values
merged_df.isnull().sum()

emoji_x          0
codepoint        0
occurrences      0
position         0
negative         0
neutral          0
positive         0
name_x           0
block            0
codepoint_std    0
group            0
subgroup         0
codepoints       0
status           0
emoji_y          0
name_y           0
dtype: int64

### 10.8. Next Steps

With the sentiment data cleaned and merged with the Unicode emoji data, we're now ready to proceed with:

- **Exploratory Data Analysis (EDA):** Uncover patterns and insights in the data.
- **Feature Engineering:** Create new features that may enhance our predictive models.
- **Model Development:** Begin building machine learning models to meet our project objectives.

---