# Preprocessing for NLP

In order to transform our unstructured text data into a form fit for data analysis, there are several steps of preprocessing that need to be done. We begin with importing the data.

In [1]:
import pandas as pd
data = pd.read_csv('g23dataset.csv')
data = data.iloc[:150]
data.head()

Unnamed: 0,ID,Timestamp,Tweet URL,Group,Collector,Category,Topic,Keywords,Account handle,Account name,...,Location,Tweet,Tweet Translated,Tweet Type,Date posted,Screenshot,Content type,Likes,Replies,Retweets
0,23-1,13/03/23 13:25:32,https://twitter.com/GirlFromIhawan/status/1534...,23,"Lorico, Hans Daniel",MRCS,Misinformation on Food Situation during Ferdin...,"Kadiwa, agricultural, Masagana 99",@GirlFromIhawan,🔥WILD TONIGHT🔥,...,,"Yes mare, panahon ni marcos sr. Nagawa na nila...","Yes friend, during the time of Marcos Sr., the...","Text, Reply",08/06/2022 12:48,https://drive.google.com/file/d/1weUGYhQjHZ9jg...,Rational,1,0,0
1,23-2,13/03/23 13:45:32,https://twitter.com/kotomba431C/status/1306285...,23,"Lorico, Hans Daniel",MRCS,Misinformation on Food Situation during Ferdin...,"Nutribun, Panahon ni Marcos, mag-aaral",@kotomba431C,Kotomba Powder,...,,Ito Ang Nutribun at Gatas na ibinibigay sa Mga...,This is the Nutribun and milk that was provide...,"Text, Image",17/09/20 1:35,https://drive.google.com/file/d/1-MhBL8k7V3TBi...,Emotional,16,1,3
2,23-3,13/03/23 13:50:32,https://twitter.com/indaysara/status/159141949...,23,"Lorico, Hans Daniel",MRCS,Misinformation on Food Situation during Ferdin...,"Nutribun, Marcos, nutrition, gulay",@indaysara,Sara Duterte,...,Davao City,"Bilang pa-birthday, handog ni Sen. Imee ang mg...","For her birthday, Sen. Imee offers these nutri...","Text, Image, Reply",12/11/22 21:15,https://drive.google.com/file/d/1cs3B8smx65wTe...,Rational,191,12,29
3,23-4,31/03/23 10:35:49,https://twitter.com/aa_alegades/status/1108343...,23,"Rosales, Christian Jay",MRCS,Misinformation on Food Situation during Ferdin...,"Walang nagugutom, Marcos",@aa_alegades,Alegades,...,"Quezon City, National Capital","Kay Marcos, walang nagugutom na estudyante! Sa...","Because of Marcos, no student was hungry. Than...",Text,20/03/19 20:25,https://drive.google.com/file/d/1HaNIoJc7WmieU...,Emotional,0,1,0
4,23-5,31/03/23 14:41:53,https://twitter.com/Yelle007/status/1303897326...,23,"Rosales, Christian Jay",MRCS,Misinformation on Food Situation during Ferdin...,"Walang nagugutom, Marcos",@Yelle007,Burger Ranger 🍔✌️😊,...,,"eschuzme! @dawende, noong time ni Marcos walan...","eschuzme! @dawende, nobody went hungry during ...","Text, Reply (comment)",09/10/20 11:25,https://drive.google.com/file/d/1iEKP_N0N-6fzM...,Emotional,3,0,1


In [2]:
data.shape

(150, 25)

It is important to tokenize the text to make sure NLP libraries can process the raw text. The Natural Language Toolkit (NLTK) python module will be used to tokenize the text. The text is transformed to lowercase to make sure the same word will not be treated as separate tokens during parsing. Punctuation and emoji will also be removed since they are not relevant in determining the main keywords used in the tweets. The stopwords will be removed using NLTK as a basis before tokenizing the text. To make sure that words with similar meanings are not treated as distinct tokens, it is also important to lemmatize the tokens. Lemmatization simplifies the word to its base meaning, known as its lemma, in order to have better results in data analysis. This allows similar words such as "better" and "good" to be processed as similar tokens. All of these preprocessing methods will be applied to the tweets in the dataframe.

In [3]:
# Import NLTK and make sure the relevant libraries are downloaded
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hanslorico/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hanslorico/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hanslorico/nltk_data...


True

In [4]:
# Extract the tweets
tweets = data['Tweet Translated'].dropna()

# Convert all tweets to lowercase
tweets = tweets.str.lower()

# Remove punctuation from tweets, remove stopwords then tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

processed_text = []

for text in tweets:
    # Punctuation Removal
    tok = ''.join([c for c in text if c.isalnum() or c == ' '])
    
    # Tokenization
    tok_list = nltk.word_tokenize(tok)
    
    # Stopword Removal
    tok_list = [word for word in tok_list if word not in stopwords.words('english')]
    
    # Lemmatization
    tok_list = [lemmatizer.lemmatize(word) for word in tok_list]
    
    processed_text.append(tok_list)
    
tweets = pd.DataFrame([processed_text])

tweets = tweets.transpose()
tweets.head()

Unnamed: 0,0
0,"[yes, friend, time, marcos, sr, able, masagana..."
1,"[nutribun, milk, provided, student, time, ferd..."
2,"[birthday, sen, imee, offer, nutribuns, child,..."
3,"[marcos, student, hungry, thanks, nutribun, ma..."
4,"[eschuzme, dawende, nobody, went, hungry, marc..."
