# Dataset Exploration

In this notebook we explore the dataset in order to determine which pre-processing steps are required.

In [1]:
from pathlib import Path
import sys
sys.path.append("../")

from config import Config
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
params = {
    'figure.figsize': (15, 5),
    'legend.fontsize': 'x-large',
    'axes.labelsize': 'x-large',
    'axes.titlesize':'x-large',
}
plt.rcParams.update(params)

In [2]:
df = pd.read_csv(Config.TRAINING_DATASET_PATH)
df

Unnamed: 0,text,irony,sarcasm
0,"Zurigo, trovato morto il presunto autore della...",0,0
1,"Zurigo, trovato morto il presunto autore della...",0,0
2,"Zingari..i soliti ""MERDOSI""..#cacciamolivia Ro...",0,0
3,"Zingari di merda,tutti al muro...bastardi Spar...",0,0
4,zero notizie decreto #tfaordinario II ciclo ze...,1,0
...,...,...,...
3972,Casini:Trovare un'intesa tra forze politiche o...,0,0
3973,Cambiare tutto per non cambiare niente sembra ...,0,0
3974,Alcuni mettono mani nelle tasche degli italian...,0,0
3975,A parte che la dieta di #Salvini dovrebbe ess...,1,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3977 entries, 0 to 3976
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     3977 non-null   object
 1   irony    3977 non-null   int64 
 2   sarcasm  3977 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 93.3+ KB


we can see that there are no missing data.

In [4]:
text, irony, sarcasm = df['text'], df['irony'], df['sarcasm']

In [5]:
text[0]

'Zurigo, trovato morto il presunto autore della sparatoria nel centro islamico #20dicembre https://t.co/rBjvUi8RJ2'

In [6]:
from preprocessing.frameutils import extract_symbols
from prettyprinting import prettyprint_list

In [7]:
symbols = extract_symbols(text)
prettyprint_list(symbols)

س 🇸 u W ì " ل ◀ 😏 ✌ 
⛪ M Y е 🤣 😒 s ~ X 9 
❓ о ‍ g n 😶 é 🙏 À 😜 
� 🤔 + م o 😔 i 6 z # 
H T 😭 r ) \ 😎 😹 V · 
E 🐐 R c 4 😷 . 5 ” 😃 
🤓   8 / ė $ O ò 0 = 
& І U ✔ » 😄 ê j 1 è 
😇 🇬 D 🤐 😀 – K f 🤦 🤕 
😡 t ° a 😁 F 🇮 🙂 b ️ 
ª … 👏 ! v 😳 I _ ó ا 
☺ 2 k | € ü G 💣 ‘ 🙈 
p y ] , á L > 🔹 h ù 
і 📍 😬 A ▶ 🇹 👍 ♂ 😋 - 
💡 💖 7 😑 : 😟 😉 í 😅 3 
B É Z 😨 “ ; < ̀ 😂 N 
  💪 [ à 💰 w 🇾 ^ 💥 @ 
😊 Ù 😞 J * ë m ’ e а 
💩 P ú 😲 « 🏼 ' x 😕 l 
È ( q 😈 🌹 d 😱 C 🙄 ? 
🇧 S Q % 👎 

As we can see the tweets contain non-alphanumeric and non ascii symbols, like emoji, which need to be handled.

In [8]:
prettyprint_list(list(filter(lambda x : not x.isascii(), symbols)))

س 🇸 ì ل ◀ 😏 ✌ ⛪ е 🤣 
😒 ❓ о ‍ 😶 é 🙏 À 😜 � 
🤔 م 😔 😭 😎 😹 · 🐐 😷 ” 
😃 🤓   ė ò І ✔ » 😄 ê 
è 😇 🇬 🤐 😀 – 🤦 🤕 😡 ° 
😁 🇮 🙂 ️ ª … 👏 😳 ó ا 
☺ € ü 💣 ‘ 🙈 á 🔹 ù і 
📍 😬 ▶ 🇹 👍 ♂ 😋 💡 💖 😑 
😟 😉 í 😅 É 😨 “ ̀ 😂 💪 
à 💰 🇾 💥 😊 Ù 😞 ë ’ а 
💩 ú 😲 « 🏼 😕 È 😈 🌹 😱 
🙄 🇧 👎 

Number of tweets classified as ironic/sarcastic.

In [29]:
total = len(text)
ironic = sum(irony)
sarcastic = sum(sarcasm)
print(f"Total tweets \t= {total} ({total/total*100:.2f}%)")
print(f"Ironic \t\t= {ironic} ({ironic/total*100:.2f}%)")
print(f"Non-ironic \t= {total - ironic} ({(total - ironic)/total*100:.2f}%)")
print(f"Sarcastic \t= {sarcastic}  ({sarcastic/total*100:.2f}%)")
print(f"Non-sarcastic \t= {total - sarcastic} ({(total-sarcastic)/total*100:.2f}%)")

Total tweets 	= 3977 (100.00%)
Ironic 		= 2023 (50.87%)
Non-ironic 	= 1954 (49.13%)
Sarcastic 	= 913  (22.96%)
Non-sarcastic 	= 3064 (77.04%)


Number of ironic tweets that are not sarcastic

In [31]:
len(df[(df["sarcasm"] == 0) & (df['irony'] == 1)].index)

1110

Ara all sarcastic tweets also classified as ironic?

In [33]:
len(df[(df["sarcasm"] == 1) & (df['irony'] == 1)].index) == sarcastic

True