# Exploratory Data Analysis
* Geographical map of observations
* Extract information from vectorizers (TF-IDF and Count)
* group_by: keyword

In [5]:
import numpy as np
import pandas as pd
from joblib import dump, load

We will first load in the saved objects, including both the vectorizers and the associated training datasets

In [7]:
# Load in TF-IDF and Count Vectorizers and training datasets
X_train_tfidf = load("X_train_tfidf.joblib")
X_train_count = load("X_train_count.joblib")
tfidf_vectorizer = load("tfidf_vectorizer.joblib")
count_vectorizer = load("count_vectorizer.joblib")

First, we will extract the vocabulary/features from the vectorizers

In [13]:
# Get feature names (words)
count_features = count_vectorizer.get_feature_names_out()
tfidf_features = tfidf_vectorizer.get_feature_names_out()

print("Sample CountVectorizer features:", count_features)
print("Sample TF-IDF features:", tfidf_features)

Sample CountVectorizer features: ['00' '000' '0000' ... 'ûónegligence' 'ûótech' 'ûówe']
Sample TF-IDF features: ['00' '000' '0000' ... 'ûónegligence' 'ûótech' 'ûówe']


As you can see, many of the features are noisy, such as IDs, hashes, and other random features. We will now extract the most frequent words and most relevant words, based on the CountVectorizer and TF-IDF Vectorizer approaches, respectively.

In [17]:
# Sum word occurrences
word_counts = np.asarray(X_train_count.sum(axis=0)).flatten()

# Create df with words and their frequencies
word_freq_df = pd.DataFrame({'word': count_features, 'count': word_counts})

# Sort by frequency
word_freq_df = word_freq_df.sort_values(by='count', ascending=False)

print("Top 10 most frequent words: ", word_freq_df[:10])

Top 10 most frequent words:              word  count
10005       http   4289
13492    missing   2514
14212        new    438
10006      https    410
12322       like    350
23111         û_    346
2138         amp    344
11325       just    319
21200        usa    281
7134   emergency    262


In [19]:
# Sum TF-IDF scores
tfidf_weights = np.asarray(X_train_tfidf.sum(axis=0)).flatten()

# Create df with words and their TF-IDF scores
tfidf_df = pd.DataFrame({'word': tfidf_features, 'tfidf': tfidf_weights})

# Sort by importance
tfidf_df = tfidf_df.sort_values(by='tfidf', ascending=False)

print("Top 10 most important words: ", tfidf_df[:10])

Top 10 most important words:              word       tfidf
10005       http  272.064231
13492    missing  219.534227
10006      https   61.679557
14212        new   60.700329
12322       like   56.467610
11325       just   51.949189
2138         amp   49.022791
23111         û_   46.001873
21200        usa   44.677352
7134   emergency   44.382660


Now we will look at the importance of words/features, in relation to if they are a real vs. fake disaster tweet

In [29]:
# Load in clean train and test sets
clean_train = pd.read_pickle("clean_train.pkl")
clean_test = pd.read_pickle("clean_test.pkl")

# Separate real and fake tweets
real_tweets = clean_train[clean_train['target'] == 1]['text']
fake_tweets = clean_train[clean_train['target'] == 0]['text']

# Vectorize separately
X_real_tfidf = tfidf_vectorizer.transform(real_tweets)
X_fake_tfidf = tfidf_vectorizer.transform(fake_tweets)

# Compute sum of TF-IDF scores per word
real_tfidf_scores = np.asarray(X_real_tfidf.sum(axis=0)).flatten()
fake_tfidf_scores = np.asarray(X_fake_tfidf.sum(axis=0)).flatten()

# Create DataFrame
word_comparison_df = pd.DataFrame({
    'word': tfidf_features,
    'real_tfidf': real_tfidf_scores,
    'fake_tfidf': fake_tfidf_scores
})

# Compute difference
word_comparison_df['difference'] = word_comparison_df['real_tfidf'] - word_comparison_df['fake_tfidf']

# Sort by words with greatest difference
word_comparison_df = word_comparison_df.sort_values(by='difference', ascending=False)

print(word_comparison_df)


             word  real_tfidf  fake_tfidf  difference
10005        http  173.835672  139.441639   34.394033
9760    hiroshima   22.452890    0.198542   22.254349
4125   california   23.239443    0.994222   22.245221
19452     suicide   23.459749    2.185977   21.273772
13332       mh370   20.049027    0.000000   20.049027
...           ...         ...         ...         ...
6519          don   10.886780   31.540566  -20.653786
3559         body    3.359401   25.313871  -21.954470
10006       https   23.160087   51.086612  -27.926525
12322        like   18.330455   49.323109  -30.992654
11325        just   15.440651   46.908047  -31.467396

[23168 rows x 4 columns]


Here is how to interpret the above results:
* High positive difference: Words are significantly more frequent in real disaster tweets
* High negative difference: Words are significantly more freuqent in fake disaster tweets
* Differnce close to 0: Words are equally present in both real and fake disaster tweets