### adatleírás

Az adat Hanga saját teljes facebook messenger history-ja, 2010-től 2020.04.18-ig. Azért nem láttok belőle mindent, a következők a korlátok:

- az üzeneteket nem látjátok, csak a hosszukat, karakterben
- mindenkinek a nevét véletlenszerűen kicseréltük egy híres emberével (a forbes celebrity top100-ból, oszkárdíjas/oszkár jelölt színészek és a top 100 kitalált karakter közül válogatva)
- a névcsere konziztens, így egy híresség mindig ugyanazt az embert jelöli (akkor is, ha a csetben esetleg valaki megváltoztatta a nevét)
- fotók, videók, stickerek és gifek sincsenek meg, csak az, hogy adott üzenetben a küldő hány ilyet küldött

### forma

Az adat *json* formátumban van, amit a pythonba úgy fogunk beolvasni, hogy egy listát kapunk, amiben dictionary-k vannak. Minden dictionary egy csetben történt eseményt jelöl, ezek a következőek lehetnek:
- valaki üzenetet küldött
- valaki megosztott valamilyen tartalmat
- valakit felvettek a beszélgetésbe vagy kirakták/kirakta magát

Minden dictionaryben 19 key van, amik a következőket jelentik:

- ```type```: ez négy féle dolog lehet
    - *Generic*: ez azt jelenti, hogy a dictionary egy sima üzenetet jelöl
    - *Share*: ez valamilyen tartalom megosztását jelenti
    - *Subscribe*: ez azt jelenti, hogy valaki hozzáadott valakit a beszélgetéshez
    - *Unsubscribe*: ez azt jelenti, hogy valaki levett valakit a beszélgetésről
- ```sender_name```: *string*, a küldő, megosztó, vagy beszélgetésbe felvevő/leszedő álneve
- ```datetime```/```year```/```month```/```day```/```hour```/```minute```: ezek az esemény idejét jelzik
- ```timestamp_ms```: az 1970 január 1. óta eltelt miliszekundumok számát jelzi (arra pl. elég jó, hogy nagyon könnyen ki lehet vele számolni két esemény között eltelt időt, meg összehasonlításra is szuper)
- ```content_l```: *float*, ez mindig az üzenetben elküldött karakterek számát jelöli
- ```gifs```/```videos```/```photos```/```sticker```: ezek a nevükben megfogalmazott tartalomtípus darabszámát jelölik az üzenetben
- ```reactions```: ez egy lista azokról az emberekről, akik reactokat nyomkodtak az üzenetre
- ```users```: ez egy lista azokról az emberekről, akik a beszélgetéshez hozzá lettek adva, vagy el lettek távolítva (csak akkor aktuális, ha a ```type``` *Subscribe* vagy *Unsubscribe* értéket vesz fel)
- ```thread_path```: ez annak a threadnek (beszélgetésnek) az azonosítója, amiben az adott esemény történt
- ```thread_type```: ez a thread fajtáját jelöli, lehet *RegularGroup* (csoportos) vagy *Regular* (egyéni)
- ```index```: ez az adott esemény indexe, évenként egyedi

### figyelmeztetések

- **Az adat nincs időrendi sorrendben!** Ha olyan feladatot szeretnétek megoldani, amihez ez hasznos, nektek kell sorbarakni.
- Vigyázzatok arra is, hogy **az index nem egyedi, még évente sem!**

### beolvasás

Az adat évenként van külön fájlokba szedve. Az alábbi egy elegáns megoldás, hogy beolvassátok ezeket egy python listába. A `glob` package listázza az adott path sémán található összes fájlt (a `*` jelzi, hogy annak a helyén bármilyen karakterek állhatnak). A `with open(path, "r") as fp` struktúra pedig megnyitja az adott path-tal rendelkező fájlt, beolvassa egy változóba, és utána be is csukja. Tehát a `get_data` függvénynek paraméteresen lehet megadni, hogy melyik fájlt olvassa be.

In [570]:
import glob
import json

In [6]:
message_files = sorted(glob.glob("data/*-msg.json"))

In [7]:
message_files

['data\\2010-msg.json',
 'data\\2011-msg.json',
 'data\\2012-msg.json',
 'data\\2013-msg.json',
 'data\\2014-msg.json',
 'data\\2015-msg.json',
 'data\\2016-msg.json',
 'data\\2017-msg.json',
 'data\\2018-msg.json',
 'data\\2019-msg.json',
 'data\\2020-msg.json']

In [8]:
def get_data(path):

    with open(path, "r") as fp:
        file = json.load(fp)

    return file

In [9]:
list_of_dicts_2010 = get_data(message_files[0])

In [13]:
list_of_dicts_2010[0]

{'index': 1797,
 'sender_name': 'Colin Firth',
 'timestamp_ms': 1289492566000,
 'type': 'Generic',
 'thread_path': 622,
 'thread_type': 'Regular',
 'reactions': [],
 'sticker': None,
 'ip': None,
 'photos': 0,
 'users': [],
 'gifs': 0,
 'videos': 0,
 'content_l': 8,
 'datetime': '2010-11-11T17:22:46.000Z',
 'year': 2010,
 'month': 11,
 'day': 11,
 'hour': 17,
 'minute': 22}

------

### ábrázolás

- ha esetleg szeretnétek valamit ábrázolni, javasoljuk hozzá a `matplotlib`, vagy a `seaborn` package-et

#### [matplotlib dokumentáció](https://matplotlib.org/3.3.3/contents.html)
#### [csinos ábrák seaborn-nal: Python Graph Gallery](https://python-graph-gallery.com/)

-------------

#### !!! Általános policy: abban az esetben, ha valamiből az elsőt kell megtalálni (vagy top5-öt a bónusz esetén), és több első is van, akkor az abc-sorrendben legelsőt adjátok meg megoldásként !!!

### explore feladatok

1. Mi Hanga álneve? (1 pont)
2. Mi a jeszk-moments azonosítója? (2 pont)
3. Ki írja átlagosan a leghosszabb üzeneteket? (2 pont)
4. Ki unscubsribeolt legtöbbször a jeszk momentsből (2 pont)
5. Ki küldte legtöbb képet a jeszk momentsbe? (2 pont)
6. Hányan vannak, akik minden évben küldtek üzenetet? (3 pont)
7. Hányan vannak, akik pontosan n évben küldtek üzenetet ($ n = 1, ... 11 $) (2 pont, ábrázolásért +1)
8. Ki van bent a második legtöbb csetben? (3 pont)
9. Melyik a 2019-es db-cset? (akkor volt Hanga DB-tag) (5 pont)
10. Ki írta a legkevesebb üzenetet a 2019-es db-chatbe? (3 pont)
11. Melyik az a cset, ahol a legtöbb idő telt el két ÜZENET között, és mennyi ez az idő? (4 pont)
12. Hány emberrel beszél Hanga 2015 óta minden évben, és hánnyal 2016 óta minden évben (rajkba kerülésének éve óta), és ez alapján kik a rajkosok ebből a listából? (4 pont)
13. Ki érte el a legtöbb átlagos reakciót a jeszk momentsben? (4 pont)

### függvény feladatok

14. Megadok egy timestamp-et, ki az eddig az időpontig írt leghosszabb üzenet szerzője? (1 pont)
15. Megadok egy timestamp-et, melyik óra eddig az időpontig a legkevésbé aktív, amiben legalább egy interakció lezajlott? (tehát nem 0 az aktivitás) (1 pont)
16. Megadok egy timestamp-et és egy embert, mondd meg, hány karaktert küldött eddig az időpontig (2 pont)
17. Megadok egy timestamp-et, hányan írtak eddig az időpontig legalább 10 üzenetet? (3 pont)
18. Megadok egy timestamp-et, melyik volt a top5 legaktívabb cset eddig az időpontig (4 pont)
19. Megadok egy timestamp-et és egy csetet (thread_id), összesen hány különböző ember írt, összesen hány üzenetet és hány karaktert írtak eddig az időpontig? (4 pont)
    - +2 pontért: emberenként csoportosítva hány üzenetet és hány karaktert küldtek
20. Megadok egy órát, átlagosan hány üzenetet küldtek ebben az órában az összes olyan napra átlagolva, amikor küldtek üzenetet? (5 pont)
21. Megadok egy timestamp-et, melyik csetbe és kicsoda írta  eddig az időpontig a legtöbb karaktert (5 pont)
22. Megadok egy timestamp-et, mi volt a leghosszabb periódus eddig az időpontig üzenet nélkül (7 pont)

### kötelező plusz

- mindegyik csapatnak kötelező bedobni legalább 2 új feladatot

### bónusz

- bárhol ahol top1-et keresünk, megadni top5öt (+1 pont)
- bármelyik csapat dobhat be a kettőn felül plusz feladatokat a megbeszélt időkorlátig publikusan, amiről mi megmondjuk, hogy hány pontot ér
- ha olyan feladatot csinálsz meg, amit egyik másik csapat sem tudott (+1 pont)

Adatok beolvasása

In [90]:
data = []
for year in message_files:
    data.extend(get_data(year))

# Mi Hanga álneve? (1 pont)

In [92]:
regular = [msg for msg in data if msg['thread_type'] == 'Regular']

In [93]:
print(set([msg['type'] for msg in regular]))

{'Generic', 'Share'}


In [94]:
direct_msg = {msg['thread_path']: set() for msg in regular}

In [95]:
[direct_msg[msg['thread_path']].add(msg['sender_name']) for msg in regular]

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [96]:
all([msg.issuperset({'Colin Firth'}) for msg in direct_msg.values() if len(msg) == 2])

True

In [176]:
HANGA = 'Colin Firth'

Mivel az előző cella igaz, ezért Hanga a Colin Firth aliast kapta

# Mi a jeszk-moments azonosítója? (2 pont) 

In [140]:
hanga_msg = [msg for msg in data if msg['sender_name'] == 'Colin Firth']

In [143]:
[msg for msg in hanga_msg if msg['year'] == 2019 and msg['month'] == 9 and msg['day'] == 19 and msg['hour'] == 21]

[{'index': 6279,
  'sender_name': 'Colin Firth',
  'timestamp_ms': 1568923173885,
  'type': 'Generic',
  'thread_path': 494,
  'thread_type': 'RegularGroup',
  'reactions': [],
  'sticker': None,
  'ip': None,
  'photos': 0,
  'users': [],
  'gifs': 0,
  'videos': 0,
  'content_l': 45,
  'datetime': '2019-09-19T21:59:33.885Z',
  'year': 2019,
  'month': 9,
  'day': 19,
  'hour': 21,
  'minute': 59}]

In [145]:
JESZK = [msg for msg in hanga_msg if msg['year'] == 2019 and msg['month'] == 9 and msg['day'] == 19 and msg['hour'] == 21][0]['thread_path']

In [147]:
JESZK

494

# Ki írja átlagosan a leghosszabb üzeneteket? (2 pont)

In [732]:
users = {msg['sender_name']: [] for msg in data if msg['type'] == 'Generic'}

In [733]:
[users[msg['sender_name']].append(msg['content_l']) for msg in data if msg['type'] == 'Generic']

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [734]:
mean = lambda ls: sum(ls)/len(ls)

In [735]:
s_users = sorted(users.items(), key = lambda x: x[0])

In [736]:
max(s_users, key = lambda x: mean(x[1]))

('Hugh Jackman', [1323])

In [739]:
sorted(s_users, key = lambda x: mean(x[1]), reverse = True)[:5]

[('Hugh Jackman', [1323]),
 ('James Bond ', [1499, 426, 1499, 5]),
 ('Stanley Tucci', [1219, 11]),
 ('Sophia Loren', [533]),
 ('Barry Fitzgerald', [426])]

# Ki unscubsribeolt legtöbbször a jeszk momentsből (2 pont)

In [624]:
jeszk_msg = [msg for msg in data if msg['thread_path'] == 494]
#a 494et más számolta ki

In [625]:
jeszk_uns = [msg for msg in jeszk_msg if msg['type'] == 'Unsubscribe']

In [626]:
users = {msg['sender_name']: [] for msg in data if msg['type'] == 'Unsubscribe'}

In [627]:
[users[msg['sender_name']].append(1) for msg in jeszk_uns]

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [628]:
i_users = sorted(users.items(), key = lambda x: x[0])

In [629]:
max(i_users, key = lambda x: sum(x[1]))

('John Malkovich', [1, 1, 1])

In [631]:
sorted(i_users, key = lambda x: sum(x[1]), reverse = True)[:5]

[('John Malkovich', [1, 1, 1]),
 ('Mahershala Ali', [1, 1, 1]),
 ('Batman', [1, 1]),
 ('Chris Cooper', [1, 1]),
 ('Denzel Washington', [1, 1])]

# Ki küldte legtöbb képet a jeszk momentsbe? (2 pont) 

In [634]:
jeszk_msg = [msg for msg in data if msg['thread_path'] == JESZK]

In [635]:
jeszk_img = [msg for msg in jeszk_msg if msg['photos'] != 0]

In [636]:
users = {msg['sender_name']: 0 for msg in data if msg['type'] == 'Generic'}

In [637]:
for msg in jeszk_img:
    users[msg['sender_name']] += msg['photos']

In [638]:
s_users = sorted(users.items(), key = lambda x: x[0])

In [639]:
max(s_users, key = lambda x: x[1])[0]

'Judy Holliday'

In [641]:
sorted(s_users, key = lambda x: x[1], reverse = True)[:5]

[('Judy Holliday', 168),
 ('Olivia de Havilland', 88),
 ('Michael Douglas', 84),
 ('Naomi Watts', 62),
 ('Holly Golightly ', 58)]

# Hányan vannak, akik minden évben küldtek üzenetet? (3 pont)

Itt minden üzenet számít vagy csak a generic?

In [121]:
users = {msg['sender_name']: set() for msg in data if msg['type'] == 'Generic'}

In [123]:
[users[msg['sender_name']].add(msg['year']) for msg in data if msg['type'] == 'Generic']

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [128]:
all_year = set([msg['year'] for msg in data])
print(all_year)

{2016, 2017, 2018, 2019, 2020, 2010, 2011, 2012, 2013, 2014, 2015}


In [137]:
[user for user, years in users.items() if years == all_year]

['Colin Firth', 'Juliette Lewis', 'U2']

# Hányan vannak, akik pontosan n évben küldtek üzenetet (𝑛=1,...11n=1,...11) (2 pont, ábrázolásért +1) 

In [159]:
users = {msg['sender_name']: set() for msg in data if msg['type'] == 'Generic'}

In [160]:
[users[msg['sender_name']].add(msg['year']) for msg in data if msg['type'] == 'Generic']

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [164]:
for n in range(1, 12):
    print(n, 'évben', len([user for user, years in users.items() if len(years) == n]), 'ember küldött üzenetet')

1 évben 277 ember küldött üzenetet
2 évben 111 ember küldött üzenetet
3 évben 73 ember küldött üzenetet
4 évben 61 ember küldött üzenetet
5 évben 65 ember küldött üzenetet
6 évben 7 ember küldött üzenetet
7 évben 3 ember küldött üzenetet
8 évben 6 ember küldött üzenetet
9 évben 4 ember küldött üzenetet
10 évben 0 ember küldött üzenetet
11 évben 3 ember küldött üzenetet


# Ki van bent a második legtöbb csetben? (3 pont)

In [464]:
def get_chats(dta):
    chats = {msg['thread_path']: set() for msg in dta}

    for msg in dta:
        users = [msg['sender_name']] + msg['reactions'] + msg['users']
        chats[msg['thread_path']].update(set(users))

    return chats

In [468]:
chats = get_chats(data)
users = set()
[users.update(x[1]) for x in chats.items()]

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [473]:
user_chats = {user: set() for user in users}

In [475]:
for chat, mems in chats.items():
    for mem in mems:
        user_chats[mem].add(chat)

In [478]:
i_user_chats = sorted(user_chats.items(), key = lambda x: x[0])

In [483]:
sorted(i_user_chats, key = lambda x: len(x[1]), reverse = True)[1][0]

'Mary J. Blige'

In [644]:
[x[0] for x in sorted(i_user_chats, key = lambda x: len(x[1]), reverse = True)[:6]]

['Colin Firth',
 'Mary J. Blige',
 'Juliette Lewis',
 'Tilda Swinton',
 'James Coburn',
 'Lucas Hedges']

# Melyik a 2019-es db-cset? (akkor volt Hanga DB-tag) (5 pont)

In [174]:
ATTILA = [msg for msg in data if msg['year'] == 2020 and msg['month'] == 4 and msg['day'] == 18 and msg['hour'] == 19 and msg['minute'] == 3 and msg['content_l'] == 17][0]['sender_name']

In [175]:
ATTILA

'Batman'

In [181]:
[msg for msg in data if msg['year'] == 2020 and msg['month'] == 4 and msg['day'] == 18 and msg['hour'] == 19 and msg['minute'] in [0,1,2,3]]

[{'index': 5,
  'sender_name': 'Broderick Crawford',
  'timestamp_ms': 1587229207660,
  'type': 'Share',
  'thread_path': 494,
  'thread_type': 'RegularGroup',
  'reactions': ['Josephine Hull', 'David Beckham'],
  'sticker': None,
  'ip': None,
  'photos': 0,
  'users': [],
  'gifs': 0,
  'videos': 0,
  'content_l': 78,
  'datetime': '2020-04-18T19:00:07.660Z',
  'year': 2020,
  'month': 4,
  'day': 18,
  'hour': 19,
  'minute': 0},
 {'index': 4,
  'sender_name': 'Batman',
  'timestamp_ms': 1587229397201,
  'type': 'Generic',
  'thread_path': 494,
  'thread_type': 'RegularGroup',
  'reactions': ['Josephine Hull', 'Liza Minnelli'],
  'sticker': None,
  'ip': None,
  'photos': 0,
  'users': [],
  'gifs': 0,
  'videos': 0,
  'content_l': 17,
  'datetime': '2020-04-18T19:03:17.201Z',
  'year': 2020,
  'month': 4,
  'day': 18,
  'hour': 19,
  'minute': 3},
 {'index': 3,
  'sender_name': 'Broderick Crawford',
  'timestamp_ms': 1587229415048,
  'type': 'Generic',
  'thread_path': 494,
  'thre

In [183]:
LAUFER = 'Broderick Crawford'

In [263]:
group_users = {msg['thread_path']: set() for msg in data if msg['thread_type'] == 'RegularGroup'}

In [264]:
[group_users[msg['thread_path']].update(set(msg['users'] + [msg['sender_name']])) for msg in data if msg['thread_type'] == 'RegularGroup']

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [287]:
[(grp, len(users)) for grp, users in group_users.items() if HANGA in users and LENTNER in users and PJOT in users]

[(494, 147),
 (798, 36),
 (797, 13),
 (86, 7),
 (347, 6),
 (785, 12),
 (367, 15),
 (697, 5),
 (670, 12)]

In [276]:
PJOT = [msg for msg in data if msg['year'] == 2020 and msg['month'] == 2 and msg['day'] == 1 and msg['hour'] == 22 and msg['minute'] == 9][0]['sender_name']
PJOT

'Katy Perry'

In [277]:
LENTNER = [msg for msg in data if msg['year'] == 2020 and msg['month'] == 2 and msg['day'] == 12 and msg['hour'] == 15 and msg['minute'] == 52][0]['sender_name']
LENTNER

'Christopher Walken'

In [297]:
[(msg['users'], msg['datetime']) for msg in data if msg['thread_path'] == 797 and msg['type'] == 'Subscribe']

[(['Don Cheadle', 'Ray Milland'], '2019-10-26T17:12:27.852Z'),
 (['Adriana Barraza', 'Glenn Close', 'Sylvester Stallone'],
  '2019-07-01T22:30:12.387Z'),
 (['Atticus Finch'], '2019-03-13T11:59:35.455Z')]

In [299]:
[(msg['users'], msg['datetime']) for msg in data if msg['thread_path'] == 797 and msg['type'] == 'Unsubscribe']

[(['Atticus Finch'], '2019-10-08T21:58:41.669Z')]

# Ki írta a legkevesebb üzenetet a 2019-es db-chatbe? (3 pont)

In [508]:
DB = 797

In [509]:
db_mems = {msg['sender_name']: [] for msg in data if msg['thread_path'] == DB and msg['type'] == 'Generic'}

In [510]:
[db_mems[msg['sender_name']].append(1) for msg in data if msg['thread_path'] == DB and msg['type'] == 'Generic']

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [511]:
s_db_mems = sorted(db_mems.items(), key = lambda x: x[0])

In [512]:
min(s_db_mems, key = lambda x: sum(x[1]))[0]

'Don Cheadle'

In [646]:
[x[0] for x in sorted(s_db_mems, key = lambda x: sum(x[1]))[:5]]

['Don Cheadle',
 'Atticus Finch',
 'Ray Milland',
 'Glenn Close',
 'Adriana Barraza']

# Melyik az a cset, ahol a legtöbb idő telt el két ÜZENET között, és mennyi ez az idő? (4 pont)

In [648]:
chats = {msg['thread_path']: [] for msg in data if msg['type'] == 'Generic'}

In [649]:
for msg in data:
    if msg['type'] == 'Generic':
        chats[msg['thread_path']].append(msg)

In [650]:
def get_timedelta(inp_ls):
    s_data = sorted(inp_ls, key = lambda x: x['timestamp_ms'])
    max_delta = 0
    for msg_i, msg in enumerate(s_data[:-1]):
        delta = s_data[msg_i + 1]['timestamp_ms'] - msg['timestamp_ms']
        if delta > max_delta:
            max_delta = delta
    return max_delta

In [651]:
i_chats = sorted(chats.items(), key = lambda x: x[0])

In [652]:
max(i_chats, key = lambda x: get_timedelta(x[1]))[0]

865

In [653]:
get_timedelta(chats[865])

145630078210

In [660]:
[(x[0], get_timedelta(chats[x[0]])) for x in sorted(i_chats, key = lambda x: get_timedelta(x[1]), reverse = True)[:5]]

[(865, 145630078210),
 (190, 126225337599),
 (267, 122252204153),
 (618, 115573115781),
 (300, 95963164309)]

# Hány emberrel beszél Hanga 2015 óta minden évben, és hánnyal 2016 óta minden évben (rajkba kerülésének éve óta), és ez alapján kik a rajkosok ebből a listából? (4 pont)

In [440]:
cut_2015 = [msg for msg in data if msg['year'] >= 2015 and msg['type'] in ['Generic', 'Share']].copy()
cut_2016 = [msg for msg in data if msg['year'] >= 2016 and msg['type'] in ['Generic', 'Share']].copy()

In [456]:
def get_friends(dta):
    users = {msg['sender_name']: set() for msg in dta}
    all_years = set([msg['year'] for msg in dta])
    for msg in dta:
        users[msg['sender_name']].add(msg['year'])
    return set([user[0] for user in users.items() if user[1] == all_years])

friends_15 = get_friends(cut_2015)
friends_16 = get_friends(cut_2016)

In [742]:
len(friends_15)

7

In [743]:
len(friends_16)

56

In [744]:
len(friends_16.difference(friends_15))

49

# Ki érte el a legtöbb átlagos reakciót a jeszk momentsben? (4 pont)

In [745]:
jeszk_msg = [msg for msg in data if msg['thread_path'] == 494 and msg['type'] in ['Generic', 'Share']]

In [746]:
users = {msg['sender_name']: {'react': 0, 'msg': 0} for msg in jeszk_msg}
#key+value(új dict)

In [747]:
for msg in jeszk_msg:
    users[msg['sender_name']]['react'] += len(msg['reactions'])
    users[msg['sender_name']]['msg'] += 1

In [748]:
s_users = sorted(users.items(), key = lambda x: x[0])

In [749]:
max(s_users, key = lambda x: x[1]['react'] / x[1]['msg'])

('Ginger Rogers', {'react': 21, 'msg': 2})

In [750]:
[x[0] for x in sorted(s_users, key = lambda x: x[1]['react'] / x[1]['msg'], reverse = True)[:5]]

['Ginger Rogers',
 'Kevin Kline',
 'Jessica Chastain',
 'Sylvester Stallone',
 'Batman']

# Megadok egy timestamp-et, ki az eddig az időpontig írt leghosszabb üzenet szerzője? (1 pont)

In [751]:
proba = 1293114321000

In [752]:
def longest_msg(timestamp):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['type'] == 'Generic']
    users = {msg['sender_name']: 0 for msg in cut}
    for msg in cut:
        if msg['content_l'] > users[msg['sender_name']]:
            users[msg['sender_name']] = msg['content_l']
    s_users = sorted(users.items(), key = lambda x: x[0])
    return max(s_users, key = lambda x: x[1])[0]

longest_msg(proba)

'Colin Firth'

In [753]:
def longest_msg_top5(timestamp):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['type'] == 'Generic']
    users = {msg['sender_name']: 0 for msg in cut}
    for msg in cut:
        if msg['content_l'] > users[msg['sender_name']]:
            users[msg['sender_name']] = msg['content_l']
    s_users = sorted(users.items(), key = lambda x: x[0])
    return [x[0] for x in sorted(s_users, key = lambda x: x[1], reverse = True)[:5]]

longest_msg_top5(proba)

['Colin Firth', 'Ruby Dee', 'The Tramp ', 'Ronald Colman', 'Barbra Streisand']

# Megadok egy timestamp-et, melyik óra eddig az időpontig a legkevésbé aktív, amiben legalább egy interakció lezajlott? (tehát nem 0 az aktivitás) 

In [754]:
def passziv(timestamp):

    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp]
    hours = {msg['hour']: 0 for msg in cut}
    
    for msg in cut:
        hours[msg['hour']] += 1
    
    #print([h for h in hours.items() if h[1] != 0]) --> a cutnál már megoldódik
    
    #return min(hours.items(), key = lambda x: x[1]) --> van több egyforma:(
    
    s_hours = sorted(hours.items(), key = lambda x: x[0])
    
    return min(s_hours, key = lambda x: x[1])[0]

passziv(proba)

9

In [755]:
def passziv(timestamp):

    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp]
    hours = {msg['hour']: 0 for msg in cut}
    
    for msg in cut:
        hours[msg['hour']] += 1
    
    #print([h for h in hours.items() if h[1] != 0]) --> a cutnál már megoldódik
    
    #return min(hours.items(), key = lambda x: x[1]) --> van több egyforma:(
    
    s_hours = sorted(hours.items(), key = lambda x: x[0])
    
    return [x[0] for x in sorted(s_hours, key = lambda x: x[1])[:5]]

passziv(proba)

[9, 23, 12, 22, 14]

# Megadok egy timestamp-et és egy embert, mondd meg, hány karaktert küldött eddig az időpontig (2 pont)

In [761]:
proba_p = 'Juliette Lewis'

In [762]:
def get_char(timestamp, user):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['type'] == 'Generic' and msg['sender_name'] == user]
    chars = 0
    for msg in cut:
        chars += msg['content_l']
    return chars    
    
get_char(proba, proba_p)

4252

# Megadok egy timestamp-et, hányan írtak eddig az időpontig legalább 10 üzenetet? (3 pont)

In [763]:
def min_10_msg(timestamp):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['type'] == 'Generic']
    mems = {msg['sender_name']: 0 for msg in cut}
    for msg in cut:
        mems[msg['sender_name']] += 1
    return len([mem for mem in mems.items() if mem[1] >= 10])

min_10_msg(proba)

2

# Megadok egy timestamp-et, melyik volt a top5 legaktívabb cset eddig az időpontig (4 pont)

In [764]:
def active_chat(timestamp):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp]
    chats = {msg['thread_path']: 0 for msg in cut}
    for msg in cut:
        chats[msg['thread_path']] += 1
    
    s_active = sorted(list(chats.items()), key = lambda x: x[0])
    return [x[0] for x in sorted(s_active, key = lambda x: x[1], reverse = True)[:5]]
    
active_chat(proba)

[32, 237, 622, 360, 546]

# Megadok egy timestamp-et és egy csetet (thread_id), összesen hány különböző ember írt, összesen hány üzenetet és hány karaktert írtak eddig az időpontig? (4 pont)

In [765]:
def chat_details(timestamp, thread_id):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['thread_path'] == thread_id and msg['type'] == 'Generic']
    mems = len(set([msg['sender_name'] for msg in cut]))
    chars = sum([msg['content_l'] for msg in cut])
    return mems, len(cut), chars

chat_details(proba, 237)

(5, 17, 930)

+2 pontért: emberenként csoportosítva hány üzenetet és hány karaktert küldtek

In [766]:
def chat_details2(timestamp, thread_id):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['thread_path'] == thread_id and msg['type'] == 'Generic']
    mems = {msg['sender_name']: {'msg': 0, 'char': 0} for msg in cut}
    for msg in cut:
        mems[msg['sender_name']]['msg'] += 1
        mems[msg['sender_name']]['char'] += msg['content_l']
    return mems

chat_details2(proba, 237)

{'Colin Firth': {'msg': 3, 'char': 126},
 'Juliette Lewis': {'msg': 7, 'char': 417},
 'Barbra Streisand': {'msg': 3, 'char': 274},
 'U2': {'msg': 3, 'char': 63},
 'Ruby Dee': {'msg': 1, 'char': 50}}

# Megadok egy órát, átlagosan hány üzenetet küldtek ebben az órában az összes olyan napra átlagolva, amikor küldtek üzenetet? (5 pont)

In [767]:
def mean_msg(hour):
    cut = [msg for msg in data if msg['hour'] == hour and msg['type'] == 'Generic']
    days = set([msg['datetime'][:10] for msg in cut])
    return len(cut)/len(days)
    
mean_msg(17)

13.714360587002096

# Megadok egy timestamp-et, melyik csetbe és kicsoda írta eddig az időpontig a legtöbb karaktert (5 pont)

In [768]:
def longest_char(timestamp):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['type'] == 'Generic']
    chats = {msg['thread_path']: 0 for msg in cut}
    users = {msg['sender_name']: chats.copy() for msg in cut}
    
    for msg in cut:
        users[msg['sender_name']][msg['thread_path']] += msg['content_l']
        
    s_winner = sorted(users.items(), key = lambda x: x[0])
    winner = max(s_winner, key = lambda x: max(x[1].values()))[0]
    
    s_chat = sorted(users[winner].items(), key = lambda x: x[0])
    chat = max(s_chat, key = lambda x: x[1])[0]
    return winner, chat

longest_char(proba)

('Colin Firth', 622)

# Megadok egy timestamp-et, mi volt a leghosszabb periódus eddig az időpontig üzenet nélkül (7 pont)

In [771]:
def longest_without_msg(timestamp):
    cut = [msg for msg in data if msg['timestamp_ms'] <= timestamp and msg['type'] == 'Generic'] 
    sort_cut = sorted(cut, key = lambda x: x['timestamp_ms'])
    
    max_t = 0
    for msg_i, msg in enumerate(sort_cut[:-1]):
        t = sort_cut[msg_i + 1]['timestamp_ms'] - msg['timestamp_ms']
        if t > max_t:# Kinek üzent először Hanga? (1 pont)

    hanga_msg = [msg for msg in data if msg['sender_name'] == 'Colin Firth']

    time_msg = min(hanga_msg, key = lambda x: x['timestamp_ms'])

    print(time_msg['thread_path'], time_msg['timestamp_ms'])

    data_cut = [msg for msg in data if time_msg['timestamp_ms'] < msg['timestamp_ms'] and msg['thread_path'] == time_msg['thread_path'] and msg['sender_name'] != 'Colin Firth']

    sorted(data_cut, key = lambda x: x['timestamp_ms'])[0]['sender_name']
            max_t = t
    return max_t

longest_without_msg(proba)

IndentationError: expected an indented block (<ipython-input-771-5bd5f6ce6ffd>, line 10)

# Kinek üzent először Hanga? (1 pont)

In [673]:
hanga_msg = [msg for msg in data if msg['sender_name'] == 'Colin Firth']

In [674]:
time_msg = min(hanga_msg, key = lambda x: x['timestamp_ms'])

In [675]:
print(time_msg['thread_path'], time_msg['timestamp_ms'])

538 1281087355000


In [676]:
data_cut = [msg for msg in data if time_msg['timestamp_ms'] < msg['timestamp_ms'] and msg['thread_path'] == time_msg['thread_path'] and msg['sender_name'] != 'Colin Firth']

In [677]:
sorted(data_cut, key = lambda x: x['timestamp_ms'])[0]['sender_name']

'Judy Sheindlin'

# Mi Fenyő álneve? (3 pont)


In [195]:
FENYŐ = [msg for msg in data if msg['year'] == 2019 and msg['month'] == 11 and msg['day'] == 12 and msg['hour'] == 17 and msg['minute'] == 58][0]['sender_name']

In [196]:
FENYŐ

'Leonardo DiCaprio'

# Mi Hanga évfolyamchatének indexe? (2 pont)


In [198]:
cut = [msg for msg in data if msg['year'] == 2016 and msg['thread_type'] == 'RegularGroup']

In [199]:
groups = {msg['thread_path']: set() for msg in cut}

In [202]:
for msg in cut:
    groups[msg['thread_path']].update([msg['sender_name']] + msg['reactions'] + msg['users'])

In [203]:
groups

{566: {'Bill Murray',
  'Clint Eastwood',
  'Dianne Wiest',
  'George Chakiris',
  'George Clooney',
  'Haley Joel Osment',
  'Linda Hunt',
  'Peter Finch',
  'Richard Dreyfuss',
  'Sandra Bullock'},
 858: {'Colin Firth', 'Juliette Lewis', 'U2'},
 558: {'Billy Bob Thornton',
  'Colin Firth',
  'Floyd Mayweather, Jr.',
  'Gale Sondergaard',
  'George Burns',
  'Hattie McDaniel',
  'James Coburn',
  'Jared Leto',
  'Jennifer Hudson',
  'Joan Fontaine',
  'Judy Holliday',
  'Kate Winslet',
  'Katy Perry',
  'King Kong',
  'Leonardo DiCaprio',
  'Lucas Hedges',
  'Mary J. Blige',
  'Michael Douglas',
  'Shohreh Aghdashloo',
  'Steve Carell',
  'Tilda Swinton'},
 144: {'Casey Affleck',
  'Colin Firth',
  'John Malkovich',
  'Miranda Richardson',
  'Miyoshi Umeki'},
 862: {'Alec Guinness',
  'Bill Murray',
  'Bradley Cooper',
  'Brenda Fricker',
  'Catalina Sandino Moreno',
  'Colin Firth',
  'Cynthia Erivo',
  'Darth Vader ',
  'Debra Winger',
  'Diane Ladd',
  'Dr. Phil McGraw',
  'Emil Ja

In [211]:
[g for g in groups.items() if 'Katy Perry' in g[1] and 'Leonardo DiCaprio' in g[1]]

[(558,
  {'Billy Bob Thornton',
   'Colin Firth',
   'Floyd Mayweather, Jr.',
   'Gale Sondergaard',
   'George Burns',
   'Hattie McDaniel',
   'James Coburn',
   'Jared Leto',
   'Jennifer Hudson',
   'Joan Fontaine',
   'Judy Holliday',
   'Kate Winslet',
   'Katy Perry',
   'King Kong',
   'Leonardo DiCaprio',
   'Lucas Hedges',
   'Mary J. Blige',
   'Michael Douglas',
   'Shohreh Aghdashloo',
   'Steve Carell',
   'Tilda Swinton'}),
 (838,
  {'Billy Bob Thornton',
   'Colin Firth',
   'Ed Harris',
   'Floyd Mayweather, Jr.',
   'Gale Sondergaard',
   'George Burns',
   'Hattie McDaniel',
   'James Coburn',
   'Jared Leto',
   'Jennifer Hudson',
   'Joan Fontaine',
   'Judy Holliday',
   'Kate Winslet',
   'Katy Perry',
   'King Kong',
   'Leonardo DiCaprio',
   'Lucas Hedges',
   'Mary J. Blige',
   'Michael Douglas',
   'Shohreh Aghdashloo',
   'Steve Carell',
   'Tilda Swinton'})]

In [212]:
len([msg for msg in data if msg['thread_path'] == 558])

13864

In [587]:
max(i_years, key = lambda x: x[1])[0]

2018

# Legfiatalabb SZMT tag

In [614]:
LAUFER

'Broderick Crawford'

# Hanga melyik évben vált a legtöbb csoport tagjává? (2 pont?)

In [774]:
cut = [msg['year'] for msg in data if msg['type'] == 'Subscribe' and HANGA in msg['users']]

In [775]:
years = {y: 0 for y in cut}

In [776]:
for y in cut:
    years[y] += 1

In [777]:
i_years = sorted(years.items(), key = lambda x: x[0])

In [778]:
max(i_years, key = lambda x: x[1])[0]

2018

In [779]:
[x for x in sorted(i_years, key = lambda x: x[1], reverse = True)[:5]]

[(2018, 10), (2019, 9), (2017, 6), (2015, 5), (2016, 3)]

# Milyen hosszú az az üzenet ami a legtöbb reakciót kapta? (3 pont)

In [589]:
max(data, key = lambda x: len(x['reactions']))['content_l']

109

In [694]:
max([msg for msg in data if msg['type'] == 'Generic'], key = lambda x: len(x['reactions']))['content_l']

109

In [692]:
[x['content_l'] for x in sorted(data, key = lambda x: len(x['reactions']), reverse = True)[:5]]

[109, 0, 26, 0, 0]

In [697]:
[x['content_l'] for x in sorted([msg for msg in data if msg['type'] == 'Generic'], key = lambda x: len(x['reactions']), reverse = True)[:5]]

[109, 0, 26, 0, 0]

# Melyik csoportos chatben küldték a legtöbb üzenetet a résztvevők számához viszonyítva (összes üzenetek száma / összes egyedi ember aki valaha tagja volt a csoportnak) (5 pont)

In [699]:
chats = get_chats([msg for msg in data if msg['thread_type'] == 'RegularGroup'])

In [700]:
for chat in chats:
    chats[chat].add(HANGA)

In [701]:
for chat in chats:
    chats[chat] = {'mems': len(chats[chat]), 'msgs': 0}

In [702]:
for msg in [msg for msg in data if msg['type'] == 'Generic' and msg['thread_type'] == 'RegularGroup']:
    chats[msg['thread_path']]['msgs'] += 1

In [703]:
s_chats = sorted(chats.items(), key = lambda x: x[0])

In [704]:
max(s_chats, key = lambda x: x[1]['msgs']/x[1]['mems'])[0]

858

In [706]:
[x[0] for x in sorted(s_chats, key = lambda x: x[1]['msgs']/x[1]['mems'], reverse = True)[:5]]

[858, 454, 857, 356, 671]