## Text Classification

In this notebook, we will perform text classification to determine if tweets about disasters are about real disasters or not (e.g. fictional disasters). Two feature extraction methods will be shown: tf-idf and word2vec.

In [1]:
import pandas as pd
import string
from nltk.corpus import stopwords
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.layers import Dense
from keras.models import Sequential
from keras import optimizers

### Loading Data

We will start by loading the data and taking a look at the first few rows. We will then drop the unnecessary columns.

In [2]:
df = pd.read_csv('tweets.csv')

In [3]:
df.head()

Unnamed: 0,id,keyword,location,text,target
0,0,ablaze,,"Communal violence in Bhainsa, Telangana. ""Ston...",1
1,1,ablaze,,Telangana: Section 144 has been imposed in Bha...,1
2,2,ablaze,New York City,Arsonist sets cars ablaze at dealership https:...,1
3,3,ablaze,"Morgantown, WV",Arsonist sets cars ablaze at dealership https:...,1
4,4,ablaze,,"""Lord Jesus, your love brings freedom and pard...",0


In [4]:
# Drop unnecessary columns
df = df.drop(columns = ["id","keyword","location"])

In [5]:
df.head()

Unnamed: 0,text,target
0,"Communal violence in Bhainsa, Telangana. ""Ston...",1
1,Telangana: Section 144 has been imposed in Bha...,1
2,Arsonist sets cars ablaze at dealership https:...,1
3,Arsonist sets cars ablaze at dealership https:...,1
4,"""Lord Jesus, your love brings freedom and pard...",0


### Preprocessing

We want to get our text in the right format to improve performance.

In [22]:
def clean_text(doc):
    """
    This function is used to clean the text data. It performs several operations to preprocess the text data.
    
    Parameters:
    doc (str): The text document that needs to be cleaned.

    Returns:
    doc (str): The cleaned text document.

    The steps involved in the cleaning process are:
    1. Convert all characters to lowercase.
    2. Replace all punctuation with a space.
    3. Split the text into tokens (words) using white space as a delimiter.
    4. Remove tokens that are not alphabetic.
    5. Filter out English stop words.
    6. Filter out short tokens (length <= 1).
    7. Join the tokens back into a single string with spaces in between.
    """

    # Convert all characters to lowercase
    doc = doc.lower()

    # Replace all punctuation with a space
    for char in string.punctuation:
        doc = doc.replace(char, ' ')

    # Split the text into tokens (words) using white space as a delimiter
    tokens = doc.split()

    # Remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]

    # Filter out English stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if not word in stop_words]

    # Filter out short tokens (length <= 1)
    tokens = [word for word in tokens if len(word) > 1]

    # Join the tokens back into a single string with spaces in between
    doc = " ".join(tokens)

    return doc

In [24]:
def clean_df(df):
    """
    This function is used to clean the text data in a DataFrame. It applies the clean_text function to each text in the DataFrame.

    Parameters:
    df (DataFrame): The DataFrame that contains the text data that needs to be cleaned.

    Returns:
    cleaned_df (list): A list of cleaned text data.

    The steps involved in the cleaning process are:
    1. Initialize an empty list, cleaned_df.
    2. Iterate over each text in the 'text' column of the DataFrame.
    3. Apply the clean_text function to each text.
    4. Append the cleaned text to the cleaned_df list.
    5. Return the cleaned_df list.
    """

    # Initialize an empty list, cleaned_df
    cleaned_df = []

    # Iterate over each text in the 'text' column of the DataFrame
    for text in tqdm(df['text']):
        # Apply the clean_text function to each text
        clean = clean_text(text)

        # Append the cleaned text to the cleaned_df list
        cleaned_df.append(clean)

    # Return the cleaned_df list
    return cleaned_df

In [25]:
cleaned_df = df
cleaned_df['text'] = df['text'].apply(lambda x: clean_text(x))

In [26]:
cleaned_df.head()

Unnamed: 0,text,target
0,communal violence bhainsa telangana stones pel...,1
1,telangana section imposed bhainsa january clas...,1
2,arsonist sets cars ablaze dealership https co ...,1
3,arsonist sets cars ablaze dealership https co ...,1
4,lord jesus love brings freedom pardon fill hol...,0


In [27]:
# Splitting the data

train_x, test_x, train_y, test_y  = train_test_split(cleaned_df['text'], cleaned_df['target'], test_size=0.2)

### Tweet classification with tf-idf

We will use the tf-idf method to convert the text data into numerical data. Then use a simple Multi-Layered-Perceptron to learn to classify the tweets.

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
# The vectorizer performs the tf_idf calculations for you
vectorizer = TfidfVectorizer(use_idf=True, max_features=900)
tf_idf_train_text = vectorizer.fit_transform(train_x).toarray()

# Make sure only to apply the transormation to test data, 
# otherwise this is leakage.
tf_idf_test_text = vectorizer.transform(test_x).toarray()

We build a simple Multi-Layered-Perceptron to learn to classify the tweets.

In [33]:
# Build the model
tf_idf_model = Sequential()
tf_idf_model.add(Dense(16, activation='relu', input_shape=(tf_idf_train_text.shape[1],)))
tf_idf_model.add(Dense(1, activation='sigmoid'))

# Compile the model
tf_idf_model.compile(loss='binary_crossentropy', optimizer=optimizers.Adam(learning_rate=0.0001), metrics=['accuracy'])

# Train the model
history = tf_idf_model.fit(tf_idf_train_text, train_y)





In [34]:
loss, accuracy = tf_idf_model.evaluate(tf_idf_test_text, test_y)
print(f'Test Accuracy: {accuracy*100:.2f}%')

Test Accuracy: 81.00%
