## instalações e imports

In [21]:
!pip install ydata-profiling
!pip install liac-arff

  from pkg_resources import load_entry_point
Collecting ydata-profiling
  Downloading ydata_profiling-4.10.0-py2.py3-none-any.whl (356 kB)
[K     |████████████████████████████████| 356 kB 2.8 MB/s eta 0:00:01
Collecting pydantic>=2
  Downloading pydantic-2.9.2-py3-none-any.whl (434 kB)
[K     |████████████████████████████████| 434 kB 31.0 MB/s eta 0:00:01
Collecting visions[type_image_path]<0.7.7,>=0.7.5
  Downloading visions-0.7.6-py3-none-any.whl (104 kB)
[K     |████████████████████████████████| 104 kB 27.7 MB/s eta 0:00:01
[?25h  Ignoring imagehash: markers 'extra == "type-image-path"' don't match your environment
  Ignoring Pillow: markers 'extra == "type-image-path"' don't match your environment
Collecting htmlmin==0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting phik<0.13,>=0.11.1
  Downloading phik-0.12.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (686 kB)
[K     |████████████████████████████████| 686 kB 29.9 MB/s eta 0:00:01
Collecting seaborn<0

In [22]:
import pandas as pd
from ydata_profiling import ProfileReport
import arff

## Conversão de arquivos

In [17]:
# Lê o arquivo ARFF
with open('dataset.arff') as f:
    data = arff.load(f)

# Extrai os metadados
attributes = data['attributes']

# Exibe os metadados
print("Atributos:")
for attr in attributes:
    print(f"  - {attr[0]}: {attr[1]}")

# Converte os dados em um DataFrame
df = pd.DataFrame(data['data'], columns=[attr[0] for attr in attributes])

# Salva os dados em um arquivo CSV
df.to_csv('dataset.csv', index=False)

Relação: Human-Memory-and-Cognition
Atributos:
  - AssignmentId: STRING
  - WorkTimeInSeconds: INTEGER
  - WorkerId: STRING
  - annotatorAge: REAL
  - annotatorGender: STRING
  - annotatorRace: STRING
  - distracted: REAL
  - draining: REAL
  - frequency: REAL
  - importance: REAL
  - logTimeSinceEvent: REAL
  - mainEvent: STRING
  - memType: STRING
  - mostSurprising: STRING
  - openness: REAL
  - recAgnPairId: STRING
  - recImgPairId: STRING
  - similarity: REAL
  - similarityReason: STRING
  - story: STRING
  - stressful: REAL
  - summary: STRING
  - timeSinceEvent: REAL
Arquivo convertido com sucesso!


## Data profilling

Os dados são provenientes do site OpenML, mais especificamente desse [dataset](https://openml.org/search?type=data&id=43596&sort=runs&status=active) Human-Memory-and-Cognition.
Esse dataset possui um total de 6854 relatos de histórias contadas por pessoas, podendo essas histórias sereme, reais, imaginadas ou recontada por outra pessoa.

Cada uma dessas histórias possui tanto informações sobre o contador da história, como ele se sentiu em relação à história e aspectos da própria história em si.

Isso é organizado em 23 atributos diferentes, sendo 11 deles númericos e 12 categóricos, que podem ser vistos abaixo.

- AssignmentId: STRING Id único para identificar cada história
- WorkTimeInSeconds: INTEGER Tempo que levou para o anotador registrar a história
- WorkerId: STRING Id único de cada anotador
- annotatorAge: FLOAT Idade do anotador organizado em buckets  18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55+
- annotatorGender: STRING Gênero do anotador, pode ser man, woman ou other
- annotatorRace: STRING Raça do anotador, pode ser black, white, other
- distracted: FLOAT O quão distraído o anotador estava enquanto anotava a história, escala de Likert de 1 a 5
- draining: FLOAT O quão exaustivo foi anotar a história, escala de Likert de 1 a 5
- frequency: FLOAT O quão frequente o anotador pensa ou fala da história , escala de Likert de 1 a 5
- importance: FLOAT O quão importamnte é a história para o anotador , escala de Likert de 1 a 5
- logTimeSinceEvent: FLOAT Quantos dias desde que a história ocorreu
- mainEvent: STRING Frase descrevendo o evento principal da história
- memType: STRING Se a história foi inventada, relembrada ou recontada
- mostSurprising: STRING Frase descrevendo o aspecto mais surpreendente da história
- openness: FLOAT Variável contínua de -1 a 1 para indicar o quão confortável, aberto o anotador se sente para falar sobre
- recAgnPairId: STRING ID da história relembrada que corresponde a esta história recontada (nulo para histórias imaginadas). Agrupe nesta variável para obter os pares recontados-recontados.
- recImgPairId: STRING ID da história relembrada que corresponde a esta história imaginada (nulo para histórias recontadas). Agrupe nesta variável para obter os pares imaginados recordados.
- similarity: FLOAT O quão similar à vida do autor essa história pareceu, escala de Likert de 1 a 5
- similarityReason: STRING Explicação para a similaridade
- story: STRING História completa
- stressful: O quão estressante foi a história, escala de Likert de 1 a 5
- summary: STRING Resumo da história
- timeSinceEvent: INT Quantos dias se passaram entre a contagem da história e a história ter ocorrido

Obs: Alguns dados se encontram na escala de Likert, que consiste em:
1 - Discordo totalmente
2 - Discordo
3 - Neutro (nem concordo, nem discordo)
4 - Concordo
5 -Concordo totalmente

Nesses dados, tem alguma informações faltante, mas principalmente nos atributos recImgPairId e recAgnPairId e no similarityReason

Abaixo, é possível visualizar o dataset em si e as informações dadas pelo profiler

In [20]:
df = pd.read_csv("dataset.csv")
pd.options.display.max_columns = 23
df

Unnamed: 0,AssignmentId,WorkTimeInSeconds,WorkerId,annotatorAge,annotatorGender,annotatorRace,distracted,draining,frequency,importance,logTimeSinceEvent,mainEvent,memType,mostSurprising,openness,recAgnPairId,recImgPairId,similarity,similarityReason,story,stressful,summary,timeSinceEvent
0,32RIADZISTQWI5XIVG5BN0VMYFRS4U,1641,XI8VK89S,25.0,man,white,1.0,1.0,,3.0,4.499810,attending a show,imagined,when I got concert tickets,0.000,,3018Q3ZVOJCZJFDMPSFXATCQ4DARA2,3.0,"I've been to a couple concerts, but not many.","Concerts are my most favorite thing, and my bo...",1.0,My boyfriend and I went to a concert together ...,90.0
1,3018Q3ZVOJCZJFDMPSFXATCQ4DARA2,1245,1HN5ZZ1D,25.0,woman,white,1.0,1.0,3.0,4.0,4.499810,a concert.,recalled,we saw the beautiful sky.,1.000,,3018Q3ZVOJCZJFDMPSFXATCQ4DARA2,,,"The day started perfectly, with a great drive ...",1.0,My boyfriend and I went to a concert together ...,90.0
2,3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61,1159,8SBPL7EI,35.0,woman,black,1.0,1.0,,4.0,5.010635,my sister having her twins a little early,imagined,she went into labor early,0.500,,3018Q3ZVOJCZJFDMPSFXATCQG04RAI,3.0,I am a mother myself,It seems just like yesterday but today makes f...,1.0,My sister gave birth to my twin niece and neph...,150.0
3,3018Q3ZVOJCZJFDMPSFXATCQG04RAI,500,M1QQED2V,30.0,woman,white,1.0,4.0,3.0,5.0,5.010635,meeting my twin niece and nephew.,recalled,finding out they were healthy.,1.000,,3018Q3ZVOJCZJFDMPSFXATCQG04RAI,,,"Five months ago, my niece and nephew were born...",2.0,My sister gave birth to my twin niece and neph...,150.0
4,3MTMREQS4W44RBU8OMP3XSK8NMJAWZ,1074,DU3RPZDB,25.0,man,white,2.0,2.0,,3.0,3.401197,the consequences of going to burning man,imagined,When I don't answer the phone in case I owe th...,0.250,,3018Q3ZVOJCZJFDMPSFXATCQG06AR3,4.0,Because I also have money problems,About a month ago I went to burning man. I was...,4.0,It is always a journey for me to go to burning...,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6849,3SKEMFQBZ4RZDN7C2AMMDQKHCV68K1,926,KVSO6L8P,30.0,woman,other,3.0,5.0,3.0,5.0,5.010635,losing and finding a pet.,recalled,the kitten ran into my arms.,0.125,,,,,My dog was diagnosed with lymphoma a year ago ...,5.0,"My dog, who had lymphoma, was suffering so I h...",150.0
6850,39PAAFCODNMWRITC4CBO6VRL6O4TV3,3044,QJB7AXPP,18.0,woman,asian,4.0,2.0,4.0,2.0,6.345636,about a vacation event worked on,recalled,when i encountered an guy who was really scared,-0.500,,,,,"Over my vacation from my job, I went to Casper...",5.0,"On vacation, a side job was taken to plan an e...",570.0
6851,3FE2ERCCZYU396R8MJGQ6TWGLSMOPR,1008,IJP8D12L,35.0,man,asian,1.0,2.0,2.0,4.0,3.044522,my nephew's birthday party,recalled,a lot of people got in the pool.,0.500,,,,,This event was a birthday party for my nephew....,2.0,This was a birthday party for my nephew that h...,21.0
6852,3J88R45B2HKQ3F50NA3MP6N9XXKPXS,1462,LCKEHYRF,30.0,man,hisp,1.0,1.0,3.0,3.0,2.639057,my cousin's birthday,recalled,my cousin threw a tantrum in the middle of the...,0.500,,,,,This event occurred about two weeks ago. I was...,2.0,It was my little cousin's birthday and went to...,14.0


In [26]:
df_reduced

Unnamed: 0,AssignmentId,WorkTimeInSeconds,WorkerId,annotatorAge,annotatorGender,annotatorRace,distracted,draining,frequency,importance,logTimeSinceEvent,memType,openness,recAgnPairId,recImgPairId,similarity,stressful,timeSinceEvent
0,32RIADZISTQWI5XIVG5BN0VMYFRS4U,1641,XI8VK89S,25.0,man,white,1.0,1.0,,3.0,4.499810,imagined,0.000,,3018Q3ZVOJCZJFDMPSFXATCQ4DARA2,3.0,1.0,90.0
1,3018Q3ZVOJCZJFDMPSFXATCQ4DARA2,1245,1HN5ZZ1D,25.0,woman,white,1.0,1.0,3.0,4.0,4.499810,recalled,1.000,,3018Q3ZVOJCZJFDMPSFXATCQ4DARA2,,1.0,90.0
2,3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61,1159,8SBPL7EI,35.0,woman,black,1.0,1.0,,4.0,5.010635,imagined,0.500,,3018Q3ZVOJCZJFDMPSFXATCQG04RAI,3.0,1.0,150.0
3,3018Q3ZVOJCZJFDMPSFXATCQG04RAI,500,M1QQED2V,30.0,woman,white,1.0,4.0,3.0,5.0,5.010635,recalled,1.000,,3018Q3ZVOJCZJFDMPSFXATCQG04RAI,,2.0,150.0
4,3MTMREQS4W44RBU8OMP3XSK8NMJAWZ,1074,DU3RPZDB,25.0,man,white,2.0,2.0,,3.0,3.401197,imagined,0.250,,3018Q3ZVOJCZJFDMPSFXATCQG06AR3,4.0,4.0,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6849,3SKEMFQBZ4RZDN7C2AMMDQKHCV68K1,926,KVSO6L8P,30.0,woman,other,3.0,5.0,3.0,5.0,5.010635,recalled,0.125,,,,5.0,150.0
6850,39PAAFCODNMWRITC4CBO6VRL6O4TV3,3044,QJB7AXPP,18.0,woman,asian,4.0,2.0,4.0,2.0,6.345636,recalled,-0.500,,,,5.0,570.0
6851,3FE2ERCCZYU396R8MJGQ6TWGLSMOPR,1008,IJP8D12L,35.0,man,asian,1.0,2.0,2.0,4.0,3.044522,recalled,0.500,,,,2.0,21.0
6852,3J88R45B2HKQ3F50NA3MP6N9XXKPXS,1462,LCKEHYRF,30.0,man,hisp,1.0,1.0,3.0,3.0,2.639057,recalled,0.500,,,,2.0,14.0


In [35]:
df_reduced = df.drop(columns=['AssignmentId','story', 'summary', 'similarityReason', 'mainEvent','mostSurprising'])

profile = ProfileReport(df, title="Relatório de Data Profiling", explorative=True, minimal=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [36]:
teste = df[['WorkTimeInSeconds', 'openness']]
profile = ProfileReport(df['importance'], title="Relatório de Data Profiling", explorative=True)
profile.to_notebook_iframe()

AttributeError: 'Series' object has no attribute 'rdd'