# Data Preprocessor

### Extracts metadata from `Social Honeypot ICWSM 2011` dataset to be used as features for model training.
### The data is preprocessed and saved separately so that it can be loaded faster during the training phase.
#### `Python 3.9.9`

In [1]:
import datetime
import pandas as pd

Load `content_polluters_tweets.txt`, remove empty tweets, and save as dataframe.

In [None]:
content_polluters_tweets_df = pd.read_csv('../Data/Dataset/social_honeypot_icwsm_2011/content_polluters_tweets.txt', sep="\t", header=None)
content_polluters_tweets_df.columns = ["UserID", "TweetID", "Tweet", "CreatedAt"]
content_polluters_tweets_df['Spam']=True
content_polluters_tweets_df = content_polluters_tweets_df[content_polluters_tweets_df['Tweet'].notna()]

Load `content_polluters.txt`, remove unneeded data, and save as dataframe.

In [None]:
content_polluters_df = pd.read_csv('../Data/Dataset/social_honeypot_icwsm_2011/content_polluters.txt', sep="\t", header=None)
content_polluters_df.columns = ["UserID", "UserCreatedAt", "CollectedAt", "NumberOfFollowings", "NumberOfFollowers", "NumberOfTweets", "LengthOfScreenName", "LengthOfDescriptionInUserProfile"]
content_polluters_df = content_polluters_df.drop(['CollectedAt'], axis=1)

Merge user data on collected tweets.

In [None]:
content_polluters_tweets_df = pd.merge(content_polluters_tweets_df, content_polluters_df, on=['UserID'])
content_polluters_tweets_df['TimeDelta(Days)'] = None

Load `legitimate_users_tweets.txt`, remove empty tweets, and save as dataframe.

In [2]:
legitimate_users_tweets_df = pd.read_csv('../Data/Dataset/social_honeypot_icwsm_2011/legitimate_users_tweets.txt', sep="\t", header=None)
legitimate_users_tweets_df.columns = ["UserID", "TweetID", "Tweet", "CreatedAt"]
legitimate_users_tweets_df['Spam']=False
legitimate_users_tweets_df = legitimate_users_tweets_df[legitimate_users_tweets_df['Tweet'].notna()]

Load `legitimate_users.txt`, remove unneeded data, and save as dataframe.

In [None]:
legitimate_users_df = pd.read_csv('../Data/Dataset/social_honeypot_icwsm_2011/content_polluters.txt', sep="\t", header=None)
legitimate_users_df.columns = ["UserID", "UserCreatedAt", "CollectedAt", "NumberOfFollowings", "NumberOfFollowers", "NumberOfTweets", "LengthOfScreenName", "LengthOfDescriptionInUserProfile"]
legitimate_users_df = legitimate_users_df.drop(['CollectedAt'], axis=1)

Merge user data on collected tweets.

In [None]:
legitimate_users_tweets_df = pd.merge(legitimate_users_tweets_df, legitimate_users_df, on=['UserID'])
legitimate_users_tweets_df['TimeDelta(Days)'] = None

Determine time difference (in days) between user account creation and time of posting for `content polluters`.

In [None]:
for i in content_polluters_tweets_df.index:
    tca = content_polluters_tweets_df.at[i, 'CreatedAt']
    uca = content_polluters_tweets_df.at[i, 'UserCreatedAt']
    content_polluters_tweets_df['TimeDelta(Days)'] = (datetime.datetime.strptime(tca, '%Y-%m-%d %H:%M:%S') - datetime.datetime.strptime(uca, '%Y-%m-%d %H:%M:%S')).days

Determine time difference (in days) between user account creation and time of posting for `legitimate users`.

In [None]:
for i in legitimate_users_tweets_df.index:
    tca = legitimate_users_tweets_df.at[i, 'CreatedAt']
    uca = legitimate_users_tweets_df.at[i, 'UserCreatedAt']
    legitimate_users_tweets_df['TimeDelta(Days)'] = (datetime.datetime.strptime(tca, '%Y-%m-%d %H:%M:%S') - datetime.datetime.strptime(uca, '%Y-%m-%d %H:%M:%S')).days

Save dataframes to CSV files:
- Content Polluters' Preprocessed Tweets: `content_polluters_tweets_pp.csv`
- Legitimte Users' Preprocessed Tweets: `legitimate_users_tweets_pp.csv`

In [None]:
content_polluters_tweets_df.to_csv('../Data/content_polluters_tweets_pp.csv')

In [None]:
legitimate_users_tweets_df.to_csv('../Data/legitimate_users_tweets_pp.csv')