# Cleanup Process

There are two main schools of thought when it comes to cleaning data for machine learning projects:

1. clean entire data, then split into train/test sets and proceeds with preprocessing for ML
2. split raw data into train/test sets, setup a reusable cleaning pipeline

The first method is easier for smaller projects that will not scale nor expect future data.

The second is the standard process for an online solution, it scales to deployment in production with new data.

Because of the large number of duplicated Tweets (mostly retweets) and because I'm not implementing an online solution, I'll prefer a hybric approach: remove duplicates prior to splitting into train/test sets, but save any further cleanup for the analytic pipeline.

This notebook is *not* the implementation of this process, it is an exploration of the data so that this process can be implemented. See `dedupe.py` for actual deduplication script which is run from the command line: ```python dedupe.py```.

Here I am prematurely creating the variable "Retweet", which is essentially feature enginering (a step down the pipeline), so that I can compare the percentage of retweets in both duplicated Tweets and deduplicated Tweets, gain insight into the amount of duplication and tune the Tweet ingestion schedule. Too many Tweets on a single day isn't worth the small gains in non-duplicated Tweets. I've got ingestion tuned to 25% duplication at 50k Tweets a day so I'm ingesting ~37,500 unique Tweets daily.

In [1]:
import os
import re
import time
import pandas as pd

In [9]:
csv_files = os.listdir(os.path.join("..","data","1_raw","tweets"))

print(f'There are {len(csv_files):0.0f} csv files.')

There are 1524 csv files.


### study dates and plot ingestion as a line graph...

In [10]:
def load_all_data():
    filepath = os.path.join("..","data","1_raw","tweets")
    dfm = []
    for f in os.listdir(filepath):
        dfm.append(pd.read_csv(os.path.join(filepath, f)))
    df = pd.concat(dfm)
    df = df.reset_index(drop=True)
    return df

In [11]:
df = load_all_data()
df.shape

(1415400, 5)

In [12]:
# quick glance at top users with multiple tweets
grouped = df[['User', 'ID']].groupby('User').count().sort_values('ID', ascending=False)
grouped[grouped['ID']>1].head()

Unnamed: 0_level_0,ID
User,Unnamed: 1_level_1
GetVidBot,82
pinkyfaye,70
vmindarling,58
sportsthread,57
KenanWaters,45


### Retweet Column

Issue: not present in sentiment140 data - do not use for comparing classifiers between projects and tracking data drift.

Note: this is a feature engineering step that is only present here for early analysis not as a cleanup step.

In [13]:
def is_retweet(col):

    for i in range(len(col)):
        if re.match(r'^RT', col) is not None:
            return 1
        else:
            return 0      
        
def map_is_retweet(col):
   
    bool_map = map(lambda x: is_retweet(x), col)      
    return(list(bool_map)) 

In [14]:
df['Retweet'] = map_is_retweet(df['Text'].values)

In [15]:
df.head()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
0,1302406168791470081,2020-09-06 00:40:02,asarinanamis,RT @Ayshiun: Totally inspired by@/kianamaiart'...,-1,1
1,1302406168766078976,2020-09-06 00:40:02,xleave_thecity,@thiinkinaboutit thank you 🥺,-1,0
2,1302406168170696706,2020-09-06 00:40:02,MsTam_Tam,some of yall retweets really be having me look...,-1,0
3,1302406168120365056,2020-09-06 00:40:02,_thebdawkk,@jeenbeen__ Thank you Jen 🥺 just miss the old ...,-1,0
4,1302406167956602880,2020-09-06 00:40:02,4ranghae1015,"@SJofficial My favorite part is the BAD boy, g...",-1,0


In [16]:
df.tail()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
1415395,1316053552976986112,2020-10-13 16:29:52,SayJoseReal,RT @amorhestia: toga it is 🥰✨ https://t.co/Sil...,1,1
1415396,1316053552956157952,2020-10-13 16:29:52,otleyshev68,@NYKChannel @elonmusk hey Elon don’t suppose p...,1,0
1415397,1316053552951816192,2020-10-13 16:29:52,Yenettirb,@camisavisionary Aye. I just screamed 😂,1,0
1415398,1316053552926687237,2020-10-13 16:29:52,JeremiahLambert,@German_Titans 😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂 \n\nGi...,1,0
1415399,1316053552897368064,2020-10-13 16:29:52,elizabethvarao,@IamSisSisisMe @carlazuill Can't wait to follo...,1,0


## POC for Deduplicating on Text

In [17]:
data = {'Text': ['apple','orange','banana','apple','banana','banana','mango']
       , 'ID': [234309, 349102, 930443, 898344, 229945, 690346, 893427]
       , 'Retweet': [0, 0, 0, 1, 1, 0, 1]}
test = pd.DataFrame(data, columns = ['Text','ID','Retweet'])
test

Unnamed: 0,Text,ID,Retweet
0,apple,234309,0
1,orange,349102,0
2,banana,930443,0
3,apple,898344,1
4,banana,229945,1
5,banana,690346,0
6,mango,893427,1


In [18]:
dupes = test[test['Text'].duplicated(keep='first')]
dupes

Unnamed: 0,Text,ID,Retweet
3,apple,898344,1
4,banana,229945,1
5,banana,690346,0


In [19]:
dupes.shape[0]/test.shape[0] # % dupes

0.42857142857142855

In [20]:
dupes[dupes['Retweet']==1].shape[0]/dupes.shape[0] # % retweets in dupes

0.6666666666666666

In [21]:
dupes['ID'] # dup ids, remove from test

3    898344
4    229945
5    690346
Name: ID, dtype: int64

In [22]:
deduped_test = test[~test.ID.isin(dupes['ID'])]
deduped_test

Unnamed: 0,Text,ID,Retweet
0,apple,234309,0
1,orange,349102,0
2,banana,930443,0
6,mango,893427,1


In [23]:
deduped_test[deduped_test['Retweet']==1].shape[0]/deduped_test.shape[0] # % retweets in deduped data 

0.25

Then check for polarity and class balance.

## Deduplicating Data

In [24]:
dupes = df[df['Text'].duplicated(keep='first')]

In [25]:
print("% dupes: " + str(100*round(dupes.shape[0]/df.shape[0], 4)))

% dupes: 26.6


In [26]:
print("% retweets in dupes: " + str(100*round(dupes[dupes['Retweet']==1].shape[0]/dupes.shape[0], 4))) 

% retweets in dupes: 99.79


In [27]:
deduped_df = df[~df.ID.isin(dupes['ID'])]

In [28]:
deduped_df.shape

(1038850, 6)

In [29]:
print("% retweets in deduped data: " + \
      str(100*round(deduped_df[deduped_df['Retweet']==1].shape[0]/deduped_df.shape[0], 4)))

% retweets in deduped data: 30.34


In [30]:
df[['ID','Polarity']].groupby('Polarity').count() # original polarity class balance

Unnamed: 0_level_0,ID
Polarity,Unnamed: 1_level_1
-1,707700
1,707700


In [33]:
# polarity class balance after deduping
polarity_df = deduped_df[['ID','Polarity']].groupby('Polarity').count()
polarity_df

Unnamed: 0_level_0,ID
Polarity,Unnamed: 1_level_1
-1,488341
1,550509


In [34]:
pct_diff = 100*round(abs(polarity_df['ID'][1] - polarity_df['ID'][-1]) / sum(polarity_df['ID']), 4)
print("% diff in target (Polarity): " + str(round(pct_diff, 4)))

% diff in target (Polarity): 5.98


In [35]:
deduped_df.index = range(len(deduped_df.index))

In [36]:
deduped_df.head()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
0,1302406168791470081,2020-09-06 00:40:02,asarinanamis,RT @Ayshiun: Totally inspired by@/kianamaiart'...,-1,1
1,1302406168766078976,2020-09-06 00:40:02,xleave_thecity,@thiinkinaboutit thank you 🥺,-1,0
2,1302406168170696706,2020-09-06 00:40:02,MsTam_Tam,some of yall retweets really be having me look...,-1,0
3,1302406168120365056,2020-09-06 00:40:02,_thebdawkk,@jeenbeen__ Thank you Jen 🥺 just miss the old ...,-1,0
4,1302406167956602880,2020-09-06 00:40:02,4ranghae1015,"@SJofficial My favorite part is the BAD boy, g...",-1,0


In [37]:
deduped_df.tail()

Unnamed: 0,ID,Timestamp,User,Text,Polarity,Retweet
1038845,1316053552976986112,2020-10-13 16:29:52,SayJoseReal,RT @amorhestia: toga it is 🥰✨ https://t.co/Sil...,1,1
1038846,1316053552956157952,2020-10-13 16:29:52,otleyshev68,@NYKChannel @elonmusk hey Elon don’t suppose p...,1,0
1038847,1316053552951816192,2020-10-13 16:29:52,Yenettirb,@camisavisionary Aye. I just screamed 😂,1,0
1038848,1316053552926687237,2020-10-13 16:29:52,JeremiahLambert,@German_Titans 😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂 \n\nGi...,1,0
1038849,1316053552897368064,2020-10-13 16:29:52,elizabethvarao,@IamSisSisisMe @carlazuill Can't wait to follo...,1,0


---