## Book 2_Twitter & Reddit_ Data Cleaning

*Note: I was unable to upload all of the datasets that I have used in these notebooks onto github because of how big the size of the file was. But if you have any questions feel free to reach out to me thank you :)*

Book 2 focuses on the data cleaning of both Twitter and Reddit.

For the cleaning of the **Twitter’s** data:
> 1. Removing of any emojis that may exist in the posts
2. Removing of any URLs
3. Removing any mentions (eg. @xxx)
4. Removing any hashtags (eg. #world)
5. Dropping any null values
6. Dropping any duplicates
7. Dropping any unnecessary rows and columns

For the cleaning of the **Reddit’s** data:
> 1. Removing of any URLs
2. Removing any mentions (eg. @xxx)
3. Removing any hashtags (eg. #world)
4. Dropping any null values
5. Dropping any unnecessary rows and columns

Twitter’s data was then be exported as: ‘T_suicide.csv’

Reddit’s data was then be exported as: ‘R_suicide.csv’


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import string
import regex as re
import re
import nltk

from sklearn.decomposition import NMF, LatentDirichletAllocation
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from wordcloud import WordCloud, ImageColorGenerator
from nltk.probability import FreqDist
from textblob import TextBlob
import preprocessor as p
from tqdm import tqdm


## Twitter's Data

In [2]:
# Importing
T_suicide = pd.read_csv('../data/tweets_merged_Final_Edited_02.csv')

In [3]:
# Removing emojis
def demoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U00010000-\U0010ffff"
                               "]+", flags=re.UNICODE)
    return(emoji_pattern.sub(r'', text))

In [4]:
T_suicide[u'Text'] = T_suicide[u'Text'].astype(str)

In [5]:
T_suicide[u'Text'] = T_suicide[u'Text'].apply(lambda x:demoji(x))

In [6]:
T_suicide.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username,Class
0,2020-01-17 23:59:10+00:00,1.22e+18,job and i wouldn’t have to save money toward a...,KaiiiKay,suicide
1,2020-01-17 23:32:32+00:00,1.22e+18,I HATE ME SO MUCH I WANT TO KILL MYSELF NO ONE...,MGlmcm,suicide
2,2020-01-17 23:30:16+00:00,1.22e+18,Back on here the cry for help but I don’t want...,Coughin_Up_Love,suicide
3,2020-01-17 22:57:16+00:00,1.22e+18,"Ya ever had that feeling of ""I don't want to k...",real_red_rabbit,suicide
4,2020-01-17 22:11:03+00:00,1.22e+18,I really want to kill myself just to stop thin...,TALA6955,suicide


In [7]:
# Exporting
T_suicide.to_csv('../data/tweets_merged_Final_Edited_03.csv',index=False, encoding='utf-8')

In [8]:
# Reimport data
T_suicide = pd.read_csv('../data/tweets_merged_Final_Edited_03.csv')

In [9]:
type(T_suicide)

pandas.core.frame.DataFrame

In [10]:
T_suicide.shape

(2182, 5)

In [11]:
T_suicide.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2182 entries, 0 to 2181
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Datetime  2182 non-null   object 
 1   Tweet Id  2182 non-null   float64
 2   Text      2180 non-null   object 
 3   Username  2182 non-null   object 
 4   Class     2182 non-null   object 
dtypes: float64(1), object(4)
memory usage: 85.4+ KB


In [12]:
T_suicide.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username,Class
0,2020-01-17 23:59:10+00:00,1.22e+18,job and i wouldn’t have to save money toward a...,KaiiiKay,suicide
1,2020-01-17 23:32:32+00:00,1.22e+18,I HATE ME SO MUCH I WANT TO KILL MYSELF NO ONE...,MGlmcm,suicide
2,2020-01-17 23:30:16+00:00,1.22e+18,Back on here the cry for help but I don’t want...,Coughin_Up_Love,suicide
3,2020-01-17 22:57:16+00:00,1.22e+18,"Ya ever had that feeling of ""I don't want to k...",real_red_rabbit,suicide
4,2020-01-17 22:11:03+00:00,1.22e+18,I really want to kill myself just to stop thin...,TALA6955,suicide


In [13]:
# Renaming columns
T_suicide.rename({'Unnamed: 0': 'unnamed: 0', 'Datetime': 'datetime', 
                  'Tweet Id': 'tweet_id', 'Text': 'text',
                  'Username': 'username', 'Class': 'class'}, axis=1, inplace=True)
T_suicide.head()

Unnamed: 0,datetime,tweet_id,text,username,class
0,2020-01-17 23:59:10+00:00,1.22e+18,job and i wouldn’t have to save money toward a...,KaiiiKay,suicide
1,2020-01-17 23:32:32+00:00,1.22e+18,I HATE ME SO MUCH I WANT TO KILL MYSELF NO ONE...,MGlmcm,suicide
2,2020-01-17 23:30:16+00:00,1.22e+18,Back on here the cry for help but I don’t want...,Coughin_Up_Love,suicide
3,2020-01-17 22:57:16+00:00,1.22e+18,"Ya ever had that feeling of ""I don't want to k...",real_red_rabbit,suicide
4,2020-01-17 22:11:03+00:00,1.22e+18,I really want to kill myself just to stop thin...,TALA6955,suicide


In [14]:
T_suicide['class'].value_counts()

suicide        1115
non-suicide    1067
Name: class, dtype: int64

In [15]:
T_suicide.isnull().sum()

datetime    0
tweet_id    0
text        2
username    0
class       0
dtype: int64

#### Removing 

In [16]:
T_suicide['text'] = T_suicide['text'].astype(str)

In [17]:
# Removing any url links 
def remove_URL(sample):
    """Remove URLs from a sample string"""
    return re.sub(r"http\S+", "", sample)

In [18]:
T_suicide['text'] = T_suicide['text'].apply(lambda x:remove_URL(x))

In [19]:
# Removing @ that exist in the tweet
def remove_mentions(sample):
    return re.sub("@[A-Za-z0-9_]+","", sample)

In [20]:
T_suicide['text'] = T_suicide['text'].apply(lambda x:remove_mentions(x))

In [21]:
# Removing # that exist in the tweet
def remove_hashtags(sample):
    return re.sub("#[A-Za-z0-9_]+","", sample)

In [22]:
T_suicide['text'] = T_suicide['text'].apply(lambda x:remove_hashtags(x))

In [23]:
T_suicide['text'] = T_suicide['text'].replace('u200d', ' ')

In [24]:
T_suicide.head()

Unnamed: 0,datetime,tweet_id,text,username,class
0,2020-01-17 23:59:10+00:00,1.22e+18,job and i wouldn’t have to save money toward a...,KaiiiKay,suicide
1,2020-01-17 23:32:32+00:00,1.22e+18,I HATE ME SO MUCH I WANT TO KILL MYSELF NO ONE...,MGlmcm,suicide
2,2020-01-17 23:30:16+00:00,1.22e+18,Back on here the cry for help but I don’t want...,Coughin_Up_Love,suicide
3,2020-01-17 22:57:16+00:00,1.22e+18,"Ya ever had that feeling of ""I don't want to k...",real_red_rabbit,suicide
4,2020-01-17 22:11:03+00:00,1.22e+18,I really want to kill myself just to stop thin...,TALA6955,suicide


#### Dropping

In [25]:
# Dropping any na rows
T_suicide.dropna(inplace = True)

In [26]:
T_suicide.shape

(2182, 5)

In [27]:
# Dropping any duplicates
T_suicide.drop_duplicates(subset = ['text'], keep = 'first', inplace = True)

# No duplicates were found (To check)
T_suicide.shape

(2065, 5)

In [28]:
# To check
T_suicide['class'].value_counts()

non-suicide    1064
suicide        1001
Name: class, dtype: int64

In [29]:
# Dropping column
T_suicide.drop(['tweet_id'], axis = 1, inplace = True)

In [30]:
# Dropping rows (suicide)
T_suicide.drop(T_suicide.index[1000:1001], inplace = True)

In [31]:
T_suicide['class'].value_counts()

non-suicide    1064
suicide        1000
Name: class, dtype: int64

In [32]:
# Dropping rows (non-suicide)
T_suicide.drop(T_suicide.index[2000:2064], inplace = True)

In [33]:
T_suicide['class'].value_counts()

suicide        1000
non-suicide    1000
Name: class, dtype: int64

In [34]:
# TO check
T_suicide.head()

Unnamed: 0,datetime,text,username,class
0,2020-01-17 23:59:10+00:00,job and i wouldn’t have to save money toward a...,KaiiiKay,suicide
1,2020-01-17 23:32:32+00:00,I HATE ME SO MUCH I WANT TO KILL MYSELF NO ONE...,MGlmcm,suicide
2,2020-01-17 23:30:16+00:00,Back on here the cry for help but I don’t want...,Coughin_Up_Love,suicide
3,2020-01-17 22:57:16+00:00,"Ya ever had that feeling of ""I don't want to k...",real_red_rabbit,suicide
4,2020-01-17 22:11:03+00:00,I really want to kill myself just to stop thin...,TALA6955,suicide


In [35]:
# To check
T_suicide.isnull().sum()

datetime    0
text        0
username    0
class       0
dtype: int64

In [36]:
# Exporting
T_suicide.to_csv('../data/T_suicide.csv', index = False)

## Reddit's Data

In [37]:
# Using Reddit's data from Kaggle
R_suicide = pd.read_csv('../data/suicide_detection_reddit_raw.csv')

In [38]:
type(R_suicide)

pandas.core.frame.DataFrame

In [39]:
R_suicide.shape

(232074, 3)

In [40]:
R_suicide.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232074 entries, 0 to 232073
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  232074 non-null  int64 
 1   text        232074 non-null  object
 2   class       232074 non-null  object
dtypes: int64(1), object(2)
memory usage: 5.3+ MB


In [41]:
R_suicide.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,3,Am I weird I don't get affected by compliments...,non-suicide
2,4,Finally 2020 is almost over... So I can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"I’m so lostHello, my name is Adam (16) and I’v...",suicide


In [42]:
R_suicide['class'].value_counts()

non-suicide    116037
suicide        116037
Name: class, dtype: int64

In [43]:
R_suicide.isnull().sum()

Unnamed: 0    0
text          0
class         0
dtype: int64

In [44]:
R_suicide['class'] = sorted(R_suicide['class'])

# To check
R_suicide['class']

0         non-suicide
1         non-suicide
2         non-suicide
3         non-suicide
4         non-suicide
             ...     
232069        suicide
232070        suicide
232071        suicide
232072        suicide
232073        suicide
Name: class, Length: 232074, dtype: object

In [45]:
# Dropping any duplicates
R_suicide.drop_duplicates(subset=["text"], keep='last', inplace = True)

# No duplicates were found (To check)
R_suicide.shape

(232074, 3)

#### Removing 

In [46]:
R_suicide['text'] = R_suicide['text'].astype(str)

In [47]:
# Removing any url links 
def remove_URL(sample):
    """Remove URLs from a sample string"""
    return re.sub(r"http\S+", "", sample)

In [48]:
R_suicide['text'] = R_suicide['text'].apply(lambda x:remove_URL(x))

In [49]:
# Removing @ that exist in the tweet
def remove_mentions(sample):
    return re.sub("@[A-Za-z0-9_]+","", sample)

In [50]:
R_suicide['text'] = R_suicide['text'].apply(lambda x:remove_mentions(x))

In [51]:
# Removing # that exist in the tweet
def remove_hashtags(sample):
    return re.sub("#[A-Za-z0-9_]+","", sample)

In [52]:
R_suicide['text'] = R_suicide['text'].apply(lambda x:remove_hashtags(x))

In [53]:
R_suicide['text'] = R_suicide['text'].replace('u200d', ' ')

In [54]:
R_suicide.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,Ex Wife Threatening SuicideRecently I left my ...,non-suicide
1,3,Am I weird I don't get affected by compliments...,non-suicide
2,4,Finally 2020 is almost over... So I can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,non-suicide
4,9,"I’m so lostHello, my name is Adam (16) and I’v...",non-suicide


#### Dropping

In [55]:
# Dropping column
R_suicide.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [56]:
# Checking for mid-point
R_suicide['class'].iloc[116036]

'non-suicide'

In [57]:
# Checking for mid-point
R_suicide['class'].iloc[116037]

'suicide'

In [58]:
# Dropping rows (non-suicide)
R_suicide.drop(R_suicide.head(115037).index,inplace=True) # drop first n rows

In [59]:
# Dropping rows (suicide)
# Drop last 46 rows of dataframe
R_suicide.drop(R_suicide.tail(115037).index, inplace=True) # drop last n rows

In [60]:
# To check that the values are the same 
R_suicide['class'].value_counts()

suicide        1000
non-suicide    1000
Name: class, dtype: int64

In [61]:
# Final check
R_suicide.isnull().sum()

text     0
class    0
dtype: int64

In [62]:
# Exporting
R_suicide.to_csv('../data/R_suicide.csv', index = False)