# Data Preparation

- **INPUT**: .csv file with message text in single column
- **OUTPUT**: several chunked .csv files with unique message text in first column (duplicates removed).
    - pass output to 2_classify.py to retrieve toxicity scores via Perspective API.
    - **NB**: Perspective API takes raw text (UTF-8) with emojis; no pre-processing neccecary. 
        - See: https://developers.perspectiveapi.com/

## setup

In [1]:
#dependencies:
import pandas as pd

#data directory:
datadir = '/Users/rptkiddle/Desktop/Network-Toxicity/data/'

## load source dataset (cleaned_data.csv)

In [2]:
#load data:
data = pd.read_csv(datadir+'cleaned_data.csv')
data.head()

Unnamed: 0,date,url,source,type,user_data,ids,message_id,text_clean,links,link_frequency,text,text_lower,nolink_text,noemoji_text,spaced_text
0,2021-06-18 04:55:38+00:00,https://t.me/fvd_nl,fvd_nl,chat,"User(id=1036480136, is_self=False, contact=Fal...",1036480000.0,15706.0,wat een aparte namen hebben die zieke kinderen...,,,Wat een aparte namen hebben die zieke kinderen...,wat een aparte namen hebben die zieke kinderen...,wat een aparte namen hebben die zieke kinderen...,wat een aparte namen hebben die zieke kinderen...,wat een aparte namen hebben die zieke kinderen...
1,2021-06-18 01:05:45+00:00,https://t.me/fvd_nl,fvd_nl,chat,"User(id=1396449004, is_self=False, contact=Fal...",1396449000.0,15704.0,jufmaikenl doet aan kindermishandeling via psy...,,,Jufmaike.nl doet aan kindermishandeling via ps...,jufmaike.nl doet aan kindermishandeling via ps...,jufmaike.nl doet aan kindermishandeling via ps...,jufmaike.nl doet aan kindermishandeling via ps...,jufmaike.nl doet aan kindermishandeling via ps...
2,2021-06-17 18:22:11+00:00,https://t.me/fvd_nl,fvd_nl,chat,"User(id=1808984447, is_self=False, contact=Fal...",1808984000.0,15702.0,,,,,,,,
3,2021-06-17 17:38:00+00:00,https://t.me/fvd_nl,fvd_nl,chat,"User(id=1697434758, is_self=False, contact=Fal...",1697435000.0,15701.0,,,,,,,,
4,2021-06-17 13:57:04+00:00,https://t.me/fvd_nl,fvd_nl,chat,"User(id=300237411, is_self=False, contact=Fals...",300237400.0,15700.0,red_exclamation_mark red_exclamation_mark red_...,,,❗❗❗❗❗❗❗❗❗❗\nORALE VACCINATIES DOOR HET DRINKWA...,❗❗❗❗❗❗❗❗❗❗\norale vaccinaties door het drinkwa...,❗❗❗❗❗❗❗❗❗❗\norale vaccinaties door het drinkwa...,:red_exclamation_mark::red_exclamation_mark::r...,:red_exclamation_mark ::red_exclamation_mark :...


## remove duplicate messages

In [3]:
#take messages column (series) into list:
messages = data['text'].tolist()

#drop duplicates: 
nodupes = list(set(messages))

#check reduction:
print(f"reduced {len(messages)} messages to {len(nodupes)} unique messages after removing duplicates.")

reduced 2033663 messages to 1321197 unique messages after removing duplicates.


## chunk for feeding to perspective api

In [4]:
#split messages into chunks for 100k:
chunked = [nodupes[i:i + 100000] for i in range(0, len(nodupes), 100000)]

#..and export to CSV:
n = 0
for chunk in chunked:
    n+=1
    pd.DataFrame(chunk).to_csv(f"{datadir}chunk{n}.csv", index = False, header = False)