# Character and Emoji Analysis for All Dataset 

### - bag of characters (BoC) analysis for full Merja's HS dataset
### - bag of emojis (BoE) analysis for full Merja's HS dataset


### - analysis for 'raw' HS Samples


### INTRODUCTION

The scripts were created and published by Merja Kreivi-Kauppinen, and are part of research work carried in University of Oulu in 2020-2023. The study is reported on (Master’s Thesis) research paper - Merja Kreivi-Kauppinen (2024) Hate Speech Detection of Dialectal, Granular and Urban Finnish. University of Oulu, Degree Programme in Computer Science and Engineering. Master’s Thesis.


### DATASET

The dataset of collected and generated research data is not shared or published.


### DATA ANALYSIS

Created dataset was evaluated by BoC and BoE analysis.

Data samples were pre-processed with lowercasing transformation. 

Characters and special characters of ‘raw’ text samples were analysed by feature extraction count vectorizer of sklearn 'feature_extraction' library. 

Result shows all character features found in created dataset. 

BoC analysis revealed Bag-of-Emojis (BoE) presented in data. 


### Import and check tensorflow, jupyter and python installations

In [1]:
import sys, re, os, openpyxl
import tqdm
from tqdm import tqdm
import tqdm as notebook_tqdm

import numpy as np
import pandas as pd

print(f"\nPython Version: {sys.version} \n")
print(sys.executable)
print(sys.version)
print(sys.version_info)
print('\njupyter version: \n')
!jupyter --version



Python Version: 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] 

C:\Users\merja\anaconda3\envs\NLPtfgpu\python.exe
3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)]
sys.version_info(major=3, minor=9, micro=13, releaselevel='final', serial=0)

jupyter version: 

Selected Jupyter core packages...
IPython          : 8.5.0
ipykernel        : 6.16.0
ipywidgets       : 8.0.2
jupyter_client   : 7.4.2
jupyter_core     : 4.11.1
jupyter_server   : not installed
jupyterlab       : not installed
nbclient         : 0.7.0
nbconvert        : 7.2.1
nbformat         : 5.7.0
notebook         : 6.4.12
qtconsole        : not installed
traitlets        : 5.4.0


### Import python packages and libraries

In [2]:
# import python packages and libraries

import time, datetime, random, string
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns
from pylab import rcParams

import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

from sklearn.feature_extraction.text import CountVectorizer

# import libraries for parsers
from pathlib import Path
import logging

# set seaborn figures
%matplotlib inline
rcParams['figure.figsize'] = 8, 4
#sns.set(style='darkgrid', palette='muted', font_scale=1.0)
sns.set(style='darkgrid', palette='Greys', font_scale=1.0)

import emoji
from emoji import *
import functools
import operator

The scikit-learn version is 1.1.2.


## Download Dataset for Analysis

### Download Manually Annotated Collection All HS Data

In [3]:
# Download labeled HS data from xlsx file to pd dataFolder

cwd = os.getcwd()
folder = '\\data\\'
csv_file = 'Manually_Annotated_Collection_ALL_FINAL_22023Merja.xlsx'
csv_source = cwd + folder + csv_file
#print(csv_source)
df = pd.read_excel(csv_source)
df

Unnamed: 0,id,sample,sentiment,polarity,HSbinary,HSstrength,HStarget,HStopic,HSform,emotion,urban_finnish,correct_finnish,user_nick
0,1,- Ajatus siitä että kaikki henkilön tienaamat...,negative,-1,False,0,,,,"UNPLEASENT, ANTICIPATION CRITICAL, SARCASTIC",- Ajatus siitä että kaikki henkilön tienaamat...,- Ajatus siitä että kaikki henkilön tienaamat...,Meria
1,2,"- Kaivovertauksessa, viime hallitus on kaivan...",negative,-2,False,0,,,,"NEUTRAL NONE, UNPLEASENT, ANTICIPATION CRITICAL","- Kaivovertauksessa, viime hallitus on kaivan...","- Kaivovertauksessa, viime hallitus on kaivan...",Meria
2,3,--´973´¤-.ttu,negative,-5,True,2,NONE,OTHER,"SWEARING, GRANULATED",CONTEMPT DISRESPECT,--´973´¤-.ttu,973 vittu,Meria
3,4,-=>Widdu joo<=-,negative,-4,True,2,NONE,TROLLING,"JOKE SARCASM, SWEARING, GRANULATED",SARCASTIC,-=> Widdu joo <=-,-=> vittu joo <=-,Meria
4,5,"- Ei , mutta olen joutunut elämään katsomalla ...",negative,-5,False,0,,,,"SADNESS, FEAR","- Ei , mutta olen joutunut elämään katsomalla ...","Ei , mutta olen joutunut elämään katsomalla ku...",Meria
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6624,6625,😂😂😂😂😂😂😂😂😂😂😂😂😂😂,positive,5,False,0,,,,JOY,😂😂😂😂😂😂😂😂😂😂😂😂😂😂,😂😂😂😂😂,Meria
6625,6626,😠👉👩‍💼,negative,-3,False,0,,,,ANGER HATE,😠👉👩‍💼,😠👉👩‍💼,Meria
6626,6627,🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣,positive,5,False,0,,,,JOY,🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣,🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣,Meria
6627,6628,🤮😡,negative,-4,False,0,,,,"DISGUST, ANGER HATE",🤮😡,🤮😡,Meria


In [4]:
# Get data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6629 entries, 0 to 6628
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               6629 non-null   int64 
 1   sample           6629 non-null   object
 2   sentiment        6629 non-null   object
 3   polarity         6629 non-null   int64 
 4   HSbinary         6629 non-null   bool  
 5   HSstrength       6629 non-null   int64 
 6   HStarget         4437 non-null   object
 7   HStopic          4437 non-null   object
 8   HSform           4437 non-null   object
 9   emotion          6629 non-null   object
 10  urban_finnish    6629 non-null   object
 11  correct_finnish  6629 non-null   object
 12  user_nick        6629 non-null   object
dtypes: bool(1), int64(3), object(9)
memory usage: 628.1+ KB


In [5]:
# character vectorizer with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

all_text = df["sample"]

character_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer="char")
character_vectorizer.fit(all_text)

In [6]:
print(character_vectorizer.get_feature_names_out())

['\n' ' ' '!' '"' '#' '%' '&' "'" '(' ')' '*' '+' ',' '-' '.' '/' '0' '1'
 '2' '3' '4' '5' '6' '7' '8' '9' ':' ';' '<' '=' '>' '?' '@' '[' '\\' ']'
 '^' '_' '`' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' '|' '~' '\xa0' '¤' '§' '¨'
 '´' 'µ' 'á' 'ä' 'å' 'í' 'ö' '÷' '\u200d' '‑' '–' '”' '…' '€' '☺' '♥' '✌'
 '✨' '❤' '️' '🇫' '🇮' '🏇' '🏻' '🐎' '🐾' '👉' '👊' '👌' '👍' '👎' '👏' '👓' '👖' '👗'
 '👙' '👚' '👠' '👩' '💎' '💕' '💖' '💗' '💜' '💝' '💞' '💩' '💪' '💼' '🔥' '🖕' '🖤' '😀'
 '😁' '😂' '😃' '😄' '😅' '😆' '😇' '😈' '😉' '😊' '😌' '😍' '😎' '😏' '😑' '😒' '😔' '😖'
 '😗' '😘' '😙' '😚' '😛' '😝' '😠' '😡' '😣' '😨' '😩' '😬' '😭' '😱' '😴' '😵' '🙄' '🙉'
 '🙏' '🚮' '🚯' '🚰' '🚱' '🚻' '🤓' '🤔' '🤗' '🤘' '🤙' '🤡' '🤣' '🤩' '🤮' '🦄' '🦋']


In [7]:
char_features = character_vectorizer.get_feature_names_out()
char_features

array(['\n', ' ', '!', '"', '#', '%', '&', "'", '(', ')', '*', '+', ',',
       '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`',
       'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
       '|', '~', '\xa0', '¤', '§', '¨', '´', 'µ', 'á', 'ä', 'å', 'í', 'ö',
       '÷', '\u200d', '‑', '–', '”', '…', '€', '☺', '♥', '✌', '✨', '❤',
       '️', '🇫', '🇮', '🏇', '🏻', '🐎', '🐾', '👉', '👊', '👌', '👍', '👎', '👏',
       '👓', '👖', '👗', '👙', '👚', '👠', '👩', '💎', '💕', '💖', '💗', '💜', '💝',
       '💞', '💩', '💪', '💼', '🔥', '🖕', '🖤', '😀', '😁', '😂', '😃', '😄', '😅',
       '😆', '😇', '😈', '😉', '😊', '😌', '😍', '😎', '😏', '😑', '😒', '😔', '😖',
       '😗', '😘', '😙', '😚', '😛', '😝', '😠', '😡', '😣', '😨', '😩', '😬', '😭',
       '😱', '😴', '😵', '🙄', '🙉', '🙏', '🚮', '🚯', '🚰', '🚱', '🚻', '🤓', '🤔',
       '🤗', '🤘', '🤙', '🤡', '🤣', '🤩', '🤮', '🦄', '🦋'], dtype=

In [8]:
character_features = " ".join([token for token in char_features])
character_features

'\n   ! " # % & \' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \\ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~ \xa0 ¤ § ¨ ´ µ á ä å í ö ÷ \u200d ‑ – ” … € ☺ ♥ ✌ ✨ ❤ ️ 🇫 🇮 🏇 🏻 🐎 🐾 👉 👊 👌 👍 👎 👏 👓 👖 👗 👙 👚 👠 👩 💎 💕 💖 💗 💜 💝 💞 💩 💪 💼 🔥 🖕 🖤 😀 😁 😂 😃 😄 😅 😆 😇 😈 😉 😊 😌 😍 😎 😏 😑 😒 😔 😖 😗 😘 😙 😚 😛 😝 😠 😡 😣 😨 😩 😬 😭 😱 😴 😵 🙄 🙉 🙏 🚮 🚯 🚰 🚱 🚻 🤓 🤔 🤗 🤘 🤙 🤡 🤣 🤩 🤮 🦄 🦋'

In [9]:
emoji_analyze_list = []

for item in char_features:
    emoji_analyzes = emoji.analyze(char_features)
    #emoji_analyze = emoji_analyze.value.emoji
    emoji_analyze = [token.value.emoji for token in emoji_analyzes]
    emoji_list = [emoji.demojize(token) for token in emoji_analyze]
    #emoji_analyze = emoji.demojize(emoji_analyze)
    emoji_analyze_list.append(emoji_list)

emoji_analyze_list

[[':smiling_face:',
  ':heart_suit:',
  ':victory_hand:',
  ':sparkles:',
  ':red_heart:',
  ':Finland:',
  ':horse_racing_light_skin_tone:',
  ':horse:',
  ':paw_prints:',
  ':backhand_index_pointing_right:',
  ':oncoming_fist:',
  ':OK_hand:',
  ':thumbs_up:',
  ':thumbs_down:',
  ':clapping_hands:',
  ':glasses:',
  ':jeans:',
  ':dress:',
  ':bikini:',
  ':woman’s_clothes:',
  ':high-heeled_shoe:',
  ':woman:',
  ':gem_stone:',
  ':two_hearts:',
  ':sparkling_heart:',
  ':growing_heart:',
  ':purple_heart:',
  ':heart_with_ribbon:',
  ':revolving_hearts:',
  ':pile_of_poo:',
  ':flexed_biceps:',
  ':briefcase:',
  ':fire:',
  ':middle_finger:',
  ':black_heart:',
  ':grinning_face:',
  ':beaming_face_with_smiling_eyes:',
  ':face_with_tears_of_joy:',
  ':grinning_face_with_big_eyes:',
  ':grinning_face_with_smiling_eyes:',
  ':grinning_face_with_sweat:',
  ':grinning_squinting_face:',
  ':smiling_face_with_halo:',
  ':smiling_face_with_horns:',
  ':winking_face:',
  ':smiling_face_