# 02. Cleaning 

After downloading all desired csv files of the Twitter users, this notebook will combine them into a single dataframe, anonymize them, remove URLs, quotes, and 'RT :' and add a column indicating whether the tweet is a retweet or original. It will also save a .txt of the entire dataframe for model fine-tuning later.

In [48]:
import os
import re
import warnings

import pandas as pd
import numpy as np

Do not upload the raw files from `01_twitter_user.py`, as they will have usernames and mentions that may make them easily identifiable. Pull all files from the folder with the Twitter data below. Then reset the index, check the shape to make sure the right number of Tweets were downloaded, and run the cells below to remove usernames, URLs, and 'RT :' text. 

In [3]:
#pull all files from folder and make single dataframe

folder = os.listdir(path='../no_upload_twitter_data/')
df = pd.DataFrame()

for filename in folder:
    if 'twitter_user_' in str(filename):
        df = df.append(pd.read_csv(f'../no_upload_twitter_data/{filename}', index_col=0))
        warnings.simplefilter("ignore") 


  df = df.append(pd.read_csv(f'../no_upload_twitter_data/{filename}', index_col=0))


In [4]:
df.reset_index(inplace=True, drop=True)

In [5]:
df.shape

(35999, 2)

In [6]:
df['Tweet'].shape[0]

35999

In [7]:
#create new column to overwrite with function below
df['new_tweet'] = df['Tweet']

In [8]:
#remove all usernames

no_usernames = []

for ind in range(0,df['Tweet'].shape[0]):
    no_usernames.append(re.sub('@[\w]+','', df['Tweet'][ind]))



In [9]:
df['new_tweet'] = no_usernames

In [10]:
df[['new_tweet']].head(2)

Unnamed: 0,new_tweet
0,That would be sign of an inexperienced inves...
1,Although the market reset is healthy and good ...


In [11]:
# remove all URLs

no_urls = []

for ind in range(0, df['new_tweet'].shape[0]):
    no_urls.append(re.sub(r'http\S+', '', df['new_tweet'][ind]))

df['no_urls'] = no_urls

In [12]:
# save whether a tweet is a retweet or not
df['retweet'] = np.where(df.no_urls.str.contains("RT :"), 1, 0)

In [13]:
df[['no_urls','retweet']].head()

Unnamed: 0,no_urls,retweet
0,That would be sign of an inexperienced inves...,0
1,Although the market reset is healthy and good ...,0
2,I'd love to see what % of seed-stage investors...,0
3,"When thinking about valuations, anchoring on 2...",0
4,"Why there are still like 23,000 different form...",0


In [14]:
# remove 'RT : ' from each tweet

no_rts = []

for ind in range(0, df['no_urls'].shape[0]):
    no_rts.append(re.sub(r'RT : ', '', df['no_urls'][ind]))

df['no_rts'] = no_rts

In [15]:
#save anonymized tweets with no URLs
#df[['no_rts','retweet']].to_csv('./data/3000_tweets.csv', index=0)

In [16]:
df.shape

(35999, 6)

In [17]:
df[['no_rts']].isnull().sum()

no_rts    0
dtype: int64

In case there are any nulls, the below cells will clean them. The only way a row would be a null is if it consisted of nothing but a username and/or a URL.

In [18]:
df.dropna(axis=0, inplace=True)

In [19]:
df.isnull().sum()

User         0
Tweet        0
new_tweet    0
no_urls      0
retweet      0
no_rts       0
dtype: int64

Now we can save the tweets as a single clean text file for processing in the next notebook.

In [20]:
df['no_rts'] = df['no_rts'].str.lower()

In [21]:
text = df['no_rts']

In [22]:
# save text without quotes
#text_no = np.savetxt('./data/3000_tweets.txt', text.values, fmt = "%s")

In [23]:
df['retweet'].value_counts(normalize=True)

0    0.771466
1    0.228534
Name: retweet, dtype: float64

The sample of ~4,000 tweets is split about 77%/23% into original tweets and retweets. 

The data can now be further explored in the next notebook.