# Data cleanup
In this notebook, we perform the operations necessary to clean up the data and split it into equivalent groups. 

In [2]:
import pandas as pd

df = pd.read_csv('../data/data.csv', index_col=0)

# Let's see what the raw data looks like. 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2295 entries, 1529697716368211968 to 1532281260626055168
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        2118 non-null   object
 1   created_at  2295 non-null   object
 2   lang        2295 non-null   object
dtypes: object(3)
memory usage: 71.7+ KB


Something I noticed is that there are 177 Tweets with "null" text. So clearly some of this data is invalid and needs to be cleaned up. There's a few cases we want to account for.

These are:
- Duplicates
- Non-English or Non-Korean Tweets
- Empty Tweets
- Spam or advertisements

In [4]:
# Drop all accidental copies
df = df.drop_duplicates()

# Drop all Tweets with the exact same message...
# These could be retweets, spam, etc...
df = df.drop_duplicates(subset=['text'])

# Drop all Tweets with the keyword 'wts'
# These are advertisements selling K-pop merchandise
df = df[df['text'].str.contains('wts') == False]

# Some of the data was gathered using methods that have left behind artifacts.
# An example would be some broken link strings (always start with https)
patterns = [r'(:?https\w+)', r'(:?RT)( )(\w+)']
for pattern in patterns:
    df['text'] = df['text'].str.replace(pattern, '', regex=True)

# Next, we only want English and Korean tweets...
langs = ['en', 'ko']
df = df[df['lang'].isin(langs)]

In [5]:
# Let's see what data we have left...
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1566 entries, 1529697716368211968 to 1532281272986664960
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        1566 non-null   object
 1   created_at  1566 non-null   object
 2   lang        1566 non-null   object
dtypes: object(3)
memory usage: 48.9+ KB


So we went from 2295 entries to 1566 entries, which is a 37.7622% decrease.

In [6]:
# Print value counts for the 'lang' column
df['lang'].value_counts()

en    1040
ko     526
Name: lang, dtype: int64

As we can see, the size of the English dataset is roughly twice the size of the Korean dataset. So, if using this dataset for machine learning purposes it is recommended to consider an undersampling approach when it comes to the English dataset. 

Next, a little bit more work has to be done to the Korean dataset.

In [10]:
# Save this cleaned data to CSV file
# This is done to ensure this process is non-destructive.
df.to_csv('../data/data_clean.csv')

In [7]:
# Split the dataset into Korean and English dataframes
df_ko = df[df['lang'] == 'ko']
df_en = df[df['lang'] == 'en']

In [8]:
df_ko

Unnamed: 0_level_0,text,created_at,lang
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1529697716368211968,에버랜드 X HYBE 가든 오브 라이츠 투바투 개쩔어유 어머뿔자 영원럽 G...,2022-05-26 05:35:30+00:00,ko
1529696380230696960,김가람탈퇴해 hybe가 김가람과 괴롭힘을 당한 사람을 화해시키는 것을 의논하고 있...,2022-05-26 05:30:12+00:00,ko
1529696274802683904,ʚ 𝐘𝐨𝐮𝐫 𝐓𝐰𝐞𝐧𝐭𝐲 ɞ 김선우 생일 응원 프로젝트 하이브 앞 버스 정...,2022-05-26 05:29:46+00:00,ko
1529697908945133568,플리캠 Simply KPop Behind Clip 아니 진짜로 리터럴리 천사...,2022-05-26 05:36:16+00:00,ko
1529697893870817280,라벤더 베레모 채원이 김채원 KIMCHAEWON チェウォン 르세라핌 LESSER...,2022-05-26 05:36:12+00:00,ko
...,...,...,...
1532562810760495106,잘생김을 넘어선 아름다움 TAEHYUNG YearsWithV 태형아사랑해 ...,2022-06-03 03:20:22+00:00,ko
1532561725983436800,두리뭉실 넘어갈 생각마라 정바비참여곡삭제해 garamOUT 하이브피해자에게사과해,2022-06-03 03:16:03+00:00,ko
1532561322768203776,최애는 내가 선택하는 것이 아니라 최애가 내게 강림하는 거라면서요 늘 노력하고 진...,2022-06-03 03:14:27+00:00,ko
1532339411186491392,멤버들이 백악관에서 한 연설이 그들의 청춘에 대한 선한 영향력이 지속될 수 있도록...,2022-06-02 12:32:39+00:00,ko


In [9]:
patterns = [r'[@%\\*=()/~#&\+á?\xc3\xa1\-\|\.\:\;\!\-\,\_\~\$\'\"]', r'\d+', r'\s+', r'<[^>]+>', r"^\s+", r'\s+$']
for pattern in patterns:
    df_ko['text'] = df_ko['text'].str.replace(pattern, '', regex=True)

df_ko

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,text,created_at,lang
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1529697716368211968,에버랜드XHYBE가든오브라이츠투바투개쩔어유어머뿔자영원럽GBGB순으로하는데포시즌스가든...,2022-05-26 05:35:30+00:00,ko
1529696380230696960,김가람탈퇴해hybe가김가람과괴롭힘을당한사람을화해시키는것을의논하고있다는데그게바로횡포실...,2022-05-26 05:30:12+00:00,ko
1529696274802683904,ʚ𝐘𝐨𝐮𝐫𝐓𝐰𝐞𝐧𝐭𝐲ɞ김선우생일응원프로젝트하이브앞버스정류장광고한강대교북단LG유플러스...,2022-05-26 05:29:46+00:00,ko
1529697908945133568,플리캠SimplyKPopBehindClip아니진짜로리터럴리천사잖아요இ௰இ♡KROUN...,2022-05-26 05:36:16+00:00,ko
1529697893870817280,라벤더베레모채원이김채원KIMCHAEWONチェウォン르세라핌LESSERAFIM,2022-05-26 05:36:12+00:00,ko
...,...,...,...
1532562810760495106,잘생김을넘어선아름다움TAEHYUNGYearsWithV태형아사랑해태형아지켜줄게정바비참...,2022-06-03 03:20:22+00:00,ko
1532561725983436800,두리뭉실넘어갈생각마라정바비참여곡삭제해garamOUT하이브피해자에게사과해,2022-06-03 03:16:03+00:00,ko
1532561322768203776,최애는내가선택하는것이아니라최애가내게강림하는거라면서요늘노력하고진실된모습보여줘서뿌듯한덕...,2022-06-03 03:14:27+00:00,ko
1532339411186491392,멤버들이백악관에서한연설이그들의청춘에대한선한영향력이지속될수있도록소속사인하이브의학폭과성...,2022-06-02 12:32:39+00:00,ko


In [11]:
# Save them to CSV files in the 'data' directory
df_ko.to_csv('../data/data_ko.csv')
df_en.to_csv('../data/data_en.csv')