# NLP Data Processing

This notebook serves to perform beginner NLP pre-processing by using a simple CountVectorizer on the tweet texts and adding the word counts as features to the ultimately exported csv file to be used for model training.

In [1]:
import pandas as pd
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize
#nltk.download('punkt')
#nltk.download('stopwords')
from nltk.corpus import stopwords

In [2]:
stopwords = stopwords.words('english')

In [3]:
data = pd.read_csv('clean_tweet_data.csv',index_col=0,parse_dates=['created_at'])

In [4]:
data.head()

Unnamed: 0,screen_name,is_quote_status,text,created_at,favorites,retweets,follower_count,is_truncated,statuses_count,Day_of_Week
0,awonderland,False,Would you trust me with your drink be honest h...,2021-02-01 00:33:12,3917,198,326241,False,44948,0
1,awonderland,False,some inspiration for you https://t.co/ovatFDkZxZ,2021-01-31 20:14:43,4815,516,326241,False,44948,6
2,awonderland,False,if you see this tweet I am chilling &amp; so s...,2021-01-31 07:37:05,5028,223,326241,False,44948,6
3,awonderland,False,i want to marry my bed,2021-01-31 02:47:53,4329,927,326241,False,44948,6
4,awonderland,False,SOPHIE was groundbreaking and inspiring in so ...,2021-01-30 19:56:59,2323,145,326241,False,44948,5


Many of the tweets contain an http.* string that references either an image or a video. I will replace all of these with a custom containsmedia string to include as another feature for the model.

In [5]:
data['text'] = data.text.str.replace('http.*','containsmedia',regex=True)
data['time_of_day'] = data['created_at'].dt.hour

In [6]:
data.head()

Unnamed: 0,screen_name,is_quote_status,text,created_at,favorites,retweets,follower_count,is_truncated,statuses_count,Day_of_Week,time_of_day
0,awonderland,False,Would you trust me with your drink be honest c...,2021-02-01 00:33:12,3917,198,326241,False,44948,0,0
1,awonderland,False,some inspiration for you containsmedia,2021-01-31 20:14:43,4815,516,326241,False,44948,6,20
2,awonderland,False,if you see this tweet I am chilling &amp; so s...,2021-01-31 07:37:05,5028,223,326241,False,44948,6,7
3,awonderland,False,i want to marry my bed,2021-01-31 02:47:53,4329,927,326241,False,44948,6,2
4,awonderland,False,SOPHIE was groundbreaking and inspiring in so ...,2021-01-30 19:56:59,2323,145,326241,False,44948,5,19


Create a DataFrame with a count of the words in each tweet:

In [7]:
word_list = pd.DataFrame()

for i in range(len(data)):
    clean_words = []
    temp_df = pd.DataFrame()
    for word in word_tokenize(data.text[i]):
        if word not in stopwords and word.isalpha():
            clean_words.append(word.lower())
            faves = data.favorites[i]
            rts = data.retweets[i]
            temp_df = pd.DataFrame(data = {'Word':clean_words,'Favorites':faves,'Retweets':rts})
    word_list = word_list.append(temp_df)

In [8]:
tweet_words = word_list.groupby(['Word']).sum().sort_values(by='Favorites',ascending=False)

In [9]:
word_counts = word_list.groupby(['Word']).count().sort_values(by='Favorites',ascending=False)

In [10]:
final_word_data = tweet_words.join(word_counts,on='Word',rsuffix=' tweet count')

Looking at some of the most commonly used words and their associated tweets performances:

In [11]:
final_word_data

Unnamed: 0_level_0,Favorites,Retweets,Favorites tweet count,Retweets tweet count
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
containsmedia,5940780,1032307,9147,9147
i,2307064,313624,1250,1250
u,833590,130998,813,813
music,650514,128796,420,420
new,603134,94813,769,769
...,...,...,...,...
yoyoyoyooooo,0,0,1,1
louiswtmarais,0,0,1,1
cheezmagic,0,0,2,2
kloeystrande,0,0,1,1


In [12]:
final_word_data['Rts/Tweet'] = final_word_data.Retweets/final_word_data['Retweets tweet count']
final_word_data['Faves/Tweet'] = final_word_data.Favorites/final_word_data['Favorites tweet count']

In [13]:
final_word_data[final_word_data['Favorites tweet count'] > 100].sort_values('Faves/Tweet',ascending = False).head(20)

Unnamed: 0_level_0,Favorites,Retweets,Favorites tweet count,Retweets tweet count,Rts/Tweet,Faves/Tweet
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
virtual,251596,47765,102,102,468.284314,2466.627451
want,256301,45603,111,111,410.837838,2309.018018
miss,227304,45856,106,106,432.603774,2144.377358
i,2307064,313624,1250,1250,250.8992,1845.6512
going,217574,38521,125,125,308.168,1740.592
play,173325,20428,107,107,190.915888,1619.859813
music,650514,128796,420,420,306.657143,1548.842857
much,299318,41454,200,200,207.27,1496.59
people,250804,35475,170,170,208.676471,1475.317647
life,162280,24097,113,113,213.247788,1436.106195


In [14]:
final_word_data.index.values

array(['containsmedia', 'i', 'u', ..., 'cheezmagic', 'kloeystrande',
       'jxstinxluke'], dtype=object)

Total number of words:

In [15]:
len(final_word_data.index.values)

10160

I will need to reduce the number of features so I will only look at words used at least 100 times, which gives a much more reasonable 115 vectors.

In [16]:
len(final_word_data[final_word_data['Favorites tweet count']>100].index.values)

115

Creating a DataFrame using only the words used over 100 times and not including stopwords:

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vocab = final_word_data[final_word_data['Favorites tweet count']>100].index.values
vocab = [word for word in vocab if not word in stopwords]

vectorizer = CountVectorizer(vocabulary = vocab)

X = []

for i in range(0,len(data)):
    X.append(vectorizer.fit_transform([data.text[i]]))

In [18]:
X_arrays = []
for i in range(0,len(X)):
    X_arrays.append(X[i].toarray())

In [19]:
import numpy as np

feature_words = pd.DataFrame(np.concatenate(X_arrays),columns=vocab)

Inspecting the new feature words DataFrame:

In [20]:
feature_words.head()

Unnamed: 0,containsmedia,u,music,new,love,today,see,first,thank,get,...,sold,ca,excited,x,thanks,pm,w,tix,check,party
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Join the feature words DataFrame to the original data and export it as a csv for further pre-processing in the next notebook:

In [21]:
data_feat = data.join(feature_words,rsuffix = 'feat')

In [22]:
data_feat.head()

Unnamed: 0,screen_name,is_quote_status,text,created_at,favorites,retweets,follower_count,is_truncated,statuses_count,Day_of_Week,...,sold,ca,excited,x,thanks,pm,w,tix,check,party
0,awonderland,False,Would you trust me with your drink be honest c...,2021-02-01 00:33:12,3917,198,326241,False,44948,0,...,0,0,0,0,0,0,0,0,0,0
1,awonderland,False,some inspiration for you containsmedia,2021-01-31 20:14:43,4815,516,326241,False,44948,6,...,0,0,0,0,0,0,0,0,0,0
2,awonderland,False,if you see this tweet I am chilling &amp; so s...,2021-01-31 07:37:05,5028,223,326241,False,44948,6,...,0,0,0,0,0,0,0,0,0,0
3,awonderland,False,i want to marry my bed,2021-01-31 02:47:53,4329,927,326241,False,44948,6,...,0,0,0,0,0,0,0,0,0,0
4,awonderland,False,SOPHIE was groundbreaking and inspiring in so ...,2021-01-30 19:56:59,2323,145,326241,False,44948,5,...,0,0,0,0,0,0,0,0,0,0


In [24]:
data_feat.to_csv('./data_wordvectorized.csv')