# Cleanup Process

There are two main schools of thought when it comes to cleaning data for machine learning projects:

1. one cleans up the entire "dataset", then split into train/test sets and proceeds with preprocessing for ML
2. one splits into train/test sets the raw data and only cleans up the training set, saving the process

The second is generally preferred since by saving the cleanup process as a step in an analytic pipeline, one can deploy it in production with new data (there is no complete dataset - why the quotes).

That said, there is a large number of duplicated Tweets (mostly retweets), and because I'm not implementing an online system, I'll use a hybrid approach: I will remove duplicates prior to splitting into train/test sets, but will save any other cleanup for the analytic pipeline.

This notebook is not this process, this notebook is a presentation of the process and exploration of the data. I am capturing the variable "Retweet" which is essentially feature enginering (a step down the pipeline) so that I can understand the percentage of retweets in both duplicated Tweets and deduplicated Tweets.


In [1]:
import os
import re
import json
import time

import string
import datetime
import urlextract
import pandas as pd

from html import unescape
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [2]:
def load_todays_data():
    filepath = os.path.join("..","data","1_raw","tweets")
    today_prefix = datetime.datetime.now().strftime("%Y%m%d")
    dfm = []
    for f in os.listdir(filepath):
        if re.match(today_prefix, f):
            dfm.append(pd.read_csv(os.path.join(filepath, f)))
    df = pd.concat(dfm)
    df = df.reset_index(drop=True)
    return df

In [3]:
#df = load_todays_data()

In [4]:
def load_all_data():
    filepath = os.path.join("..","data","1_raw","tweets")
    #today_prefix = datetime.datetime.now().strftime("%Y%m%d")
    dfm = []
    for f in os.listdir(filepath):
        #if re.match(today_prefix, f):
        dfm.append(pd.read_csv(os.path.join(filepath, f)))
    df = pd.concat(dfm)
    df = df.reset_index(drop=True)
    return df

In [5]:
df = load_all_data()

In [6]:
df.shape

(975400, 5)

In [7]:
# test for duplicated IDs
ids = df["ID"]
df[ids.isin(ids[ids.duplicated()])].shape

(0, 5)

In [10]:
# test for duplicated Text
txt = df["Text"]
df[txt.isin(txt[txt.duplicated()])].shape

(293705, 5)

In [11]:
# look at users with more than 1 tweet?
grouped = df[['User', 'ID']].groupby('User').count().sort_values('ID', ascending=False)

grouped[grouped['ID']>1].head()

Unnamed: 0_level_0,ID
User,Unnamed: 1_level_1
pinkyfaye,57
GetVidBot,52
KenanWaters,43
vmindarling,37
DeadPoolzNutz,35


## Add in Feature Engineering step

### Retweet Column

Issue: not present in sentiment140 data

In [12]:
def is_retweet(col):

    for i in range(len(col)):
        if re.match(r'^RT', col) is not None:
            return 1
        else:
            return 0      
        
def map_is_retweet(col):
   
    bool_map = map(lambda x: is_retweet(x), col)       
    return(list(bool_map)) 

In [13]:
df['Retweet'] = map_is_retweet(df['Text'].values)

In [14]:
df.head()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
0,1302406168791470081,2020-09-06 00:40:02,asarinanamis,RT @Ayshiun: Totally inspired by@/kianamaiart'...,-1,1
1,1302406168766078976,2020-09-06 00:40:02,xleave_thecity,@thiinkinaboutit thank you 🥺,-1,0
2,1302406168170696706,2020-09-06 00:40:02,MsTam_Tam,some of yall retweets really be having me look...,-1,0
3,1302406168120365056,2020-09-06 00:40:02,_thebdawkk,@jeenbeen__ Thank you Jen 🥺 just miss the old ...,-1,0
4,1302406167956602880,2020-09-06 00:40:02,4ranghae1015,"@SJofficial My favorite part is the BAD boy, g...",-1,0


In [15]:
df.tail()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
975395,1312909692356575232,2020-10-05 00:17:17,min9yufluffy,RT @wonubliss: jeonghan's gift to seungcheol:\...,1,1
975396,1312909692285472768,2020-10-05 00:17:17,shawny_strolls,@acciopage394 So like when I’m at work and try...,1,0
975397,1312909692268687360,2020-10-05 00:17:17,melanin_sugar,@yadastarot Maria 🥰 thanx,1,0
975398,1312909692268548097,2020-10-05 00:17:17,atomicoffin,@pamelarenfree i’ve been having crazy back pai...,1,0
975399,1312909692226555904,2020-10-05 00:17:17,Dory24960234,Bless morning♥️thankyou Lord 🥰♥️,1,0


## Deduplicating on Text POC

In [28]:
data = {'Text': ['apple','orange','banana','apple','banana','banana','mango']
       , 'ID': [234309, 349102, 930443, 898344, 229945, 690346, 893427]
       , 'Retweet': [0, 0, 0, 1, 1, 0, 1]}
test = pd.DataFrame(data, columns = ['Text','ID','Retweet'])
test

Unnamed: 0,Text,ID,Retweet
0,apple,234309,0
1,orange,349102,0
2,banana,930443,0
3,apple,898344,1
4,banana,229945,1
5,banana,690346,0
6,mango,893427,1


In [29]:
dupes = test[test['Text'].duplicated(keep='first')]
dupes

Unnamed: 0,Text,ID,Retweet
3,apple,898344,1
4,banana,229945,1
5,banana,690346,0


In [30]:
dupes.shape[0]/test.shape[0] # % dupes

0.42857142857142855

In [31]:
dupes[dupes['Retweet']==1].shape[0]/dupes.shape[0] # % retweets in dupes

0.6666666666666666

In [36]:
dupes['ID'] # dup ids, remove from test

3    898344
4    229945
5    690346
Name: ID, dtype: int64

In [39]:
deduped_test = test[~test.ID.isin(dupes['ID'])]
deduped_test

Unnamed: 0,Text,ID,Retweet
0,apple,234309,0
1,orange,349102,0
2,banana,930443,0
6,mango,893427,1


In [41]:
deduped_test[deduped_test['Retweet']==1].shape[0]/deduped_test.shape[0] # % retweets in deduped data 

0.25

Then check for polarity and class balance.

### Deduping (real thing)

In [42]:
dupes = df[df['Text'].duplicated(keep='first')]

In [44]:
dupes.shape[0]/df.shape[0] # % dupes

0.2527803977855239

In [45]:
dupes[dupes['Retweet']==1].shape[0]/dupes.shape[0] # % retweets in dupes

0.9980978415165354

In [46]:
deduped_df = df[~df.ID.isin(dupes['ID'])]

In [47]:
deduped_df.shape

(728838, 6)

In [48]:
deduped_df[deduped_df['Retweet']==1].shape[0]/deduped_df.shape[0] # % retweets in deduped data 

0.3049882141161685

In [49]:
df[['ID','Polarity']].groupby('Polarity').count() # original polarity class balance

Unnamed: 0_level_0,ID
Polarity,Unnamed: 1_level_1
-1,487700
1,487700


In [50]:
deduped_df[['ID','Polarity']].groupby('Polarity').count() # polarity class balance after deduping

Unnamed: 0_level_0,ID
Polarity,Unnamed: 1_level_1
-1,343200
1,385638


In [53]:
deduped_df.index = range(len(deduped_df.index))

In [54]:
deduped_df.head()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
0,1302406168791470081,2020-09-06 00:40:02,asarinanamis,RT @Ayshiun: Totally inspired by@/kianamaiart'...,-1,1
1,1302406168766078976,2020-09-06 00:40:02,xleave_thecity,@thiinkinaboutit thank you 🥺,-1,0
2,1302406168170696706,2020-09-06 00:40:02,MsTam_Tam,some of yall retweets really be having me look...,-1,0
3,1302406168120365056,2020-09-06 00:40:02,_thebdawkk,@jeenbeen__ Thank you Jen 🥺 just miss the old ...,-1,0
4,1302406167956602880,2020-09-06 00:40:02,4ranghae1015,"@SJofficial My favorite part is the BAD boy, g...",-1,0


In [55]:
deduped_df.tail()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
728833,1312909692356575232,2020-10-05 00:17:17,min9yufluffy,RT @wonubliss: jeonghan's gift to seungcheol:\...,1,1
728834,1312909692285472768,2020-10-05 00:17:17,shawny_strolls,@acciopage394 So like when I’m at work and try...,1,0
728835,1312909692268687360,2020-10-05 00:17:17,melanin_sugar,@yadastarot Maria 🥰 thanx,1,0
728836,1312909692268548097,2020-10-05 00:17:17,atomicoffin,@pamelarenfree i’ve been having crazy back pai...,1,0
728837,1312909692226555904,2020-10-05 00:17:17,Dory24960234,Bless morning♥️thankyou Lord 🥰♥️,1,0


In [None]:
# 