# Data Cleaning and Preprocessing

So in the Data Wrangling notebook I noted that I would need to clean and prepare the data for analysis, but I didn't realize just how many steps that would take. I ended up spending an entire notebook wrangling the data, trying to get it in the format I needed it in, and trying to access the specific data that I wanted in the first place (while ignoring irrelevant data).

Now that I've got it in the format I want it in, I think I can take the time to clean it up and get it ready for analysis.

In [164]:
import pandas as pd

In [165]:
df = pd.read_pickle("C:/Users/jzpow/Code/Projects/Naomi-Serena/data/naomi-serena-tweets.pkl")

In [166]:
df.head()

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
5,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
6,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
7,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka


In [167]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23828 entries, 0 to 14998
Data columns (total 4 columns):
id              23828 non-null int64
tweet_date      23828 non-null object
tweet_text      23828 non-null object
search query    23828 non-null object
dtypes: int64(1), object(3)
memory usage: 930.8+ KB


In [168]:
df = df.reset_index(drop = True)

In [169]:
df['tweet_text'][:100]

0     Naomi Osaka upsets Serena Williams in controve...
1     @ Naomi_Osaka_ , you go girl! I got your back!...
2     @ Naomi_Osaka_ probably felt like she was at h...
3     Congrats girly, don’t let anyone take this mom...
4     Naomi Osaka defeats Serena Williams in a drama...
5     https://twitter.com/juventino5555/status/10377...
6     Carlos Ramos also robbed Osaka. Imagine how mu...
7     Yes Bravo to @ BigSascha Bajin And of course l...
8     Tennis officials.. where coaches are seen coac...
9     Naomi Osaka tops Serena Williams in U.S. Open ...
10    Booing damn Naomi Osaka won the girl was cooki...
11    You should take the with you @ Naomi_Osaka_ Co...
12    @ Naomi_Osaka_ Congratulaions. . . who were te...
13    The proof that something can be done is when y...
14    Her ambition, she once told a reporter, was “t...
15    [FULL] 2018 US Open trophy ceremony with Seren...
16    What's happening? Part 1. Naomi Osaka vs Seren...
17    @ Naomi_Osaka_ Congratulations You are pro

So from examining a few rows of the tweet data, I can see a few things that can be cleaned up:

* links
* hashtags (i.e. #3StripeLife)
* mentions (i.e. @ Naomi_Osaka_ or @ serenawilliams
* misspellings (i.e. "awaful")
* headlines (i.e. [FULL] 2018 US Open Trophy...)
* slang (i.e. She got screwed by the ump)

I'm not sure I can clean all of these up, but I can at least do a little bit. So let's get started!

*The preprocessing done in this notebook is based on the following tutorial: https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html*

In [170]:
test = df[:10].copy()
test

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
4,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka
5,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...,naomi osaka
6,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,naomi osaka
7,11,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to @ BigSascha Bajin And of course l...,naomi osaka
8,12,Sat Sep 08 19:59:50 +0000 2018,Tennis officials.. where coaches are seen coac...,naomi osaka
9,13,Sat Sep 08 19:59:50 +0000 2018,Naomi Osaka tops Serena Williams in U.S. Open ...,naomi osaka


## Step 0: Noise Removal

What the tutorial calls "noise removal." Noise removal is task-specific, meaning we need to take our data into consideration when removing the noise.

For this tweet data, it looks like we'll want to remove hashtags, links and mentions.

In [171]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNewshttps://edition.cnn.com/2018/09/08/sport/naomi-osaka-serena-williams-us-open-tennis-int-spt/index.html …
@ Naomi_Osaka_ , you go girl! I got your back! Congrats on the US open!
@ Naomi_Osaka_ probably felt like she was at her friend’s house when their mom started yelling at them # usopen
Congrats girly, don’t let anyone take this moment from you..you outplayed everyone, even the GOAT Serena @ Naomi_Osaka_
Naomi Osaka defeats Serena Williams in a dramatic US Open final https://twitter.com/i/events/1038540032330493952 …
https://twitter.com/juventino5555/status/1037768949109276672?s=19 …
Carlos Ramos also robbed Osaka. Imagine how much better she would feel if she broke Serena to go up 5-3 instead of being giving the game.
Yes Bravo to @ BigSascha Bajin And of course love as always (since she turned pro!) to @ Naomi_Osaka_ https://twitter.com/bgtennisnation/status/1038563742961881090 …
Naomi Osaka tops Sere

### 0.0: Remove Hyperlinks

In [172]:
import re

In [173]:
url_pattern = "http[^\s]+\s?…?"

In [174]:
# for tweet in test.loc[test['tweet_text']]:
#     if re.search(url_pattern, tweet) is not None:
#         tweet = re.sub(url_pattern, ' ', tweet)
#     else:
#         pass

In [175]:
# test.loc[test['tweet_text'].str.contains("http")] = "TEST"

In [176]:
test.loc[:, 'tweet_text'].replace(url_pattern, " ", regex=True, inplace=True)

In [177]:
test

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
4,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka
5,9,Sat Sep 08 19:59:55 +0000 2018,,naomi osaka
6,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,naomi osaka
7,11,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to @ BigSascha Bajin And of course l...,naomi osaka
8,12,Sat Sep 08 19:59:50 +0000 2018,Tennis officials.. where coaches are seen coac...,naomi osaka
9,13,Sat Sep 08 19:59:50 +0000 2018,Naomi Osaka tops Serena Williams in U.S. Open ...,naomi osaka


In [178]:
test.iloc[5]

id                                           9
tweet_date      Sat Sep 08 19:59:55 +0000 2018
tweet_text                                    
search query                       naomi osaka
Name: 5, dtype: object

In [179]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNews 
@ Naomi_Osaka_ , you go girl! I got your back! Congrats on the US open!
@ Naomi_Osaka_ probably felt like she was at her friend’s house when their mom started yelling at them # usopen
Congrats girly, don’t let anyone take this moment from you..you outplayed everyone, even the GOAT Serena @ Naomi_Osaka_
Naomi Osaka defeats Serena Williams in a dramatic US Open final  
 
Carlos Ramos also robbed Osaka. Imagine how much better she would feel if she broke Serena to go up 5-3 instead of being giving the game.
Yes Bravo to @ BigSascha Bajin And of course love as always (since she turned pro!) to @ Naomi_Osaka_  
Naomi Osaka tops Serena Williams in U.S. Open final, becomes first Japanese grand slam singles champion | The Japan Times  


### 0.1: Remove Mentions

I don't think that having these in here will contribute to data analysis. I've already got the search query tagged in a separate column. Besides, names don't seem to have that much sentiment attached to them, so it will be much easier to remove any mentions.

In [180]:
mentions_pattern = '@\s[A-Za-z0-9_]+'

In [181]:
test.loc[:, 'tweet_text'].replace(mentions_pattern, " ", regex=True, inplace=True)

In [182]:
test

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,", you go girl! I got your back! Congrats on ...",naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,probably felt like she was at her friend’s h...,naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
4,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka
5,9,Sat Sep 08 19:59:55 +0000 2018,,naomi osaka
6,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,naomi osaka
7,11,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to Bajin And of course love as alw...,naomi osaka
8,12,Sat Sep 08 19:59:50 +0000 2018,Tennis officials.. where coaches are seen coac...,naomi osaka
9,13,Sat Sep 08 19:59:50 +0000 2018,Naomi Osaka tops Serena Williams in U.S. Open ...,naomi osaka


I figured out how to update the rows without getting the `SettingwithCopyWarning`: you have to indicate all rows `:` and then the desired column. I'm going to update the above code to reflect this.

**ETA 2018.10.11**: I'm reading through *Python for Data Analysis* which helped me figure out how to update data in a series. In addition to the above, I simply had to flag the update as `inplace=True` and this seemed to work beautifully.

I couldnt' remove all of the mentions, since some usernames had spaces, meaning parts of them were left behind. But I think this is no different than having someone's name in there in the first place; it still won't convey much information. All in all, I think getting as many as I could is better than not getting any at all, and the bits of mentions left over shouldn't contribute much to the final analysis.

In [183]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNews 
  , you go girl! I got your back! Congrats on the US open!
  probably felt like she was at her friend’s house when their mom started yelling at them # usopen
Congrats girly, don’t let anyone take this moment from you..you outplayed everyone, even the GOAT Serena  
Naomi Osaka defeats Serena Williams in a dramatic US Open final  
 
Carlos Ramos also robbed Osaka. Imagine how much better she would feel if she broke Serena to go up 5-3 instead of being giving the game.
Yes Bravo to   Bajin And of course love as always (since she turned pro!) to    
Naomi Osaka tops Serena Williams in U.S. Open final, becomes first Japanese grand slam singles champion | The Japan Times  


### 0.2: Remove Hashtags

I'm a bit nervous to do this because sometimes people use them as meaningful parts of the tweet. So I think I'll just remove the hash mark instead of the entire hashtag itself. For this data set, it shouldn't be hard to do because the hash mark appears separated by whitespace from the hashtag itself.

In [184]:
test.loc[:, 'tweet_text'].replace("#", "", regex=True, inplace=True)

In [185]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final - CNN  SmartNews 
  , you go girl! I got your back! Congrats on the US open!
  probably felt like she was at her friend’s house when their mom started yelling at them  usopen
Congrats girly, don’t let anyone take this moment from you..you outplayed everyone, even the GOAT Serena  
Naomi Osaka defeats Serena Williams in a dramatic US Open final  
 
Carlos Ramos also robbed Osaka. Imagine how much better she would feel if she broke Serena to go up 5-3 instead of being giving the game.
Yes Bravo to   Bajin And of course love as always (since she turned pro!) to    
Naomi Osaka tops Serena Williams in U.S. Open final, becomes first Japanese grand slam singles champion | The Japan Times  


## Step 1: Normalization

Now that the text appears relatively cleaned, I'll have to (as the tutorial says) *"[put] all words on equal footing, [allowing the] processing to proceed uniformly."* This means casing the text, removing or converting numbers, getting rid of all special characters, and replacing contractions (otherwise, the tokenization step will convert words like "we'll" into "we" and "'ll.")

### 1.0: Contraction Expansion

What an oxymoronic phrase, that is!

This step has to be done before removing special characters, or else we'll lose the contractions.

The `contractions.py` file used in this section was downloaded from [DJ Sarkar's repo](https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%20content/nlp%20proven%20approach/contractions.py), and the code was made available from his [Guide to NLP on Towards Data Science](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72).

In [186]:
from contractions import CONTRACTION_MAP

In [187]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    
    return expanded_text

My tweets are using curly quotes "’" instead of straight ones, so I'll have to go through and update all of these first. Then I'll expand the contractions.

In [188]:
test.loc[:, 'tweet_text'].replace("’", "'", regex=True, inplace=True)

In [189]:
test['tweet_text'] = test['tweet_text'].apply(lambda x: expand_contractions(x))

In [190]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final - CNN  SmartNews 
  , you go girl! I got your back! Congrats on the US open!
  probably felt like she was at her friends house when their mom started yelling at them  usopen
Congrats girly, do not let anyone take this moment from you..you outplayed everyone, even the GOAT Serena  
Naomi Osaka defeats Serena Williams in a dramatic US Open final  
 
Carlos Ramos also robbed Osaka. Imagine how much better she would feel if she broke Serena to go up 5-3 instead of being giving the game.
Yes Bravo to   Bajin And of course love as always (since she turned pro!) to    
Naomi Osaka tops Serena Williams in U.S. Open final, becomes first Japanese grand slam singles champion | The Japan Times  


### 1.1: Special Character Removal

Let's get rid of all the non-alphanumeric characters. I'm also going to remove all numerics as I'm not sure they will contribute to the sentiment of the tweet.

In [191]:
test.loc[:, 'tweet_text'].replace("[^a-zA-Z\s]", " ", regex=True, inplace=True)

In [192]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final   CNN  SmartNews 
    you go girl  I got your back  Congrats on the US open 
  probably felt like she was at her friends house when their mom started yelling at them  usopen
Congrats girly  do not let anyone take this moment from you  you outplayed everyone  even the GOAT Serena  
Naomi Osaka defeats Serena Williams in a dramatic US Open final  
 
Carlos Ramos also robbed Osaka  Imagine how much better she would feel if she broke Serena to go up     instead of being giving the game 
Yes Bravo to   Bajin And of course love as always  since she turned pro   to    
Naomi Osaka tops Serena Williams in U S  Open final  becomes first Japanese grand slam singles champion   The Japan Times  


### 1.2: Text Casing

Simply converting all text to lower case.

In [193]:
test['tweet_text'] = test['tweet_text'].apply(lambda x: x.lower())

In [194]:
for tweet in test['tweet_text']:
    print(tweet)

naomi osaka upsets serena williams in controversial us open final   cnn  smartnews 
    you go girl  i got your back  congrats on the us open 
  probably felt like she was at her friends house when their mom started yelling at them  usopen
congrats girly  do not let anyone take this moment from you  you outplayed everyone  even the goat serena  
naomi osaka defeats serena williams in a dramatic us open final  
 
carlos ramos also robbed osaka  imagine how much better she would feel if she broke serena to go up     instead of being giving the game 
yes bravo to   bajin and of course love as always  since she turned pro   to    
naomi osaka tops serena williams in u s  open final  becomes first japanese grand slam singles champion   the japan times  


## Step 2: Preprocessing

This is the final step before analysis, consisting of three parts: tokenization, stemming, and lemmatization.

First, let's remove all tweets that were rendered NaN due to the previous steps.

In [195]:
import numpy as np

In [196]:
test['tweet_text'].replace(' ', np.nan, inplace=True)

In [197]:
test.dropna(inplace=True)

In [198]:
test.reset_index(drop=True)

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,naomi osaka upsets serena williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,you go girl i got your back congrats on ...,naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,probably felt like she was at her friends ho...,naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,congrats girly do not let anyone take this mo...,naomi osaka
4,8,Sat Sep 08 19:59:55 +0000 2018,naomi osaka defeats serena williams in a drama...,naomi osaka
5,10,Sat Sep 08 19:59:54 +0000 2018,carlos ramos also robbed osaka imagine how mu...,naomi osaka
6,11,Sat Sep 08 19:59:52 +0000 2018,yes bravo to bajin and of course love as alw...,naomi osaka
7,12,Sat Sep 08 19:59:50 +0000 2018,tennis officials where coaches are seen coac...,naomi osaka
8,13,Sat Sep 08 19:59:50 +0000 2018,naomi osaka tops serena williams in u s open ...,naomi osaka


### 2.0: Tokenization

Convert all tweets into a bag of words.

In [199]:
import nltk

In [200]:
test['tweet_text'] = test['tweet_text'].apply(lambda x: nltk.word_tokenize(x))

In [201]:
for tweet in test['tweet_text']:
    print(tweet)

['naomi', 'osaka', 'upsets', 'serena', 'williams', 'in', 'controversial', 'us', 'open', 'final', 'cnn', 'smartnews']
['you', 'go', 'girl', 'i', 'got', 'your', 'back', 'congrats', 'on', 'the', 'us', 'open']
['probably', 'felt', 'like', 'she', 'was', 'at', 'her', 'friends', 'house', 'when', 'their', 'mom', 'started', 'yelling', 'at', 'them', 'usopen']
['congrats', 'girly', 'do', 'not', 'let', 'anyone', 'take', 'this', 'moment', 'from', 'you', 'you', 'outplayed', 'everyone', 'even', 'the', 'goat', 'serena']
['naomi', 'osaka', 'defeats', 'serena', 'williams', 'in', 'a', 'dramatic', 'us', 'open', 'final']
['carlos', 'ramos', 'also', 'robbed', 'osaka', 'imagine', 'how', 'much', 'better', 'she', 'would', 'feel', 'if', 'she', 'broke', 'serena', 'to', 'go', 'up', 'instead', 'of', 'being', 'giving', 'the', 'game']
['yes', 'bravo', 'to', 'bajin', 'and', 'of', 'course', 'love', 'as', 'always', 'since', 'she', 'turned', 'pro', 'to']
['naomi', 'osaka', 'tops', 'serena', 'williams', 'in', 'u', 's', '

I see a lot of stop words that I don't want in here, so let's get rid of those now.

In [202]:
from nltk.corpus import stopwords
stop = stopwords.words("english")

In [203]:
test['tweet_text'] = test['tweet_text'].apply(lambda x: [word for word in x if word not in (stop) and len(word) > 1])

In [204]:
for tweet in test['tweet_text']:
    print(tweet)

['naomi', 'osaka', 'upsets', 'serena', 'williams', 'controversial', 'us', 'open', 'final', 'cnn', 'smartnews']
['go', 'girl', 'got', 'back', 'congrats', 'us', 'open']
['probably', 'felt', 'like', 'friends', 'house', 'mom', 'started', 'yelling', 'usopen']
['congrats', 'girly', 'let', 'anyone', 'take', 'moment', 'outplayed', 'everyone', 'even', 'goat', 'serena']
['naomi', 'osaka', 'defeats', 'serena', 'williams', 'dramatic', 'us', 'open', 'final']
['carlos', 'ramos', 'also', 'robbed', 'osaka', 'imagine', 'much', 'better', 'would', 'feel', 'broke', 'serena', 'go', 'instead', 'giving', 'game']
['yes', 'bravo', 'bajin', 'course', 'love', 'always', 'since', 'turned', 'pro']
['naomi', 'osaka', 'tops', 'serena', 'williams', 'open', 'final', 'becomes', 'first', 'japanese', 'grand', 'slam', 'singles', 'champion', 'japan', 'times']


### 2.1: Stemming/Lemmatization

Reducing inflectional forms of words. After quick research, I've decided to try lemmatization im an attempt to better preserve linguistic meaning.

> Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore, we usually prefer using lemmatization over stemming. ([Source](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/))

In [205]:
from textblob import Word

In [206]:
test['tweet_text'] = test['tweet_text'].apply(lambda x: [Word(word).lemmatize() for word in x])

In [207]:
for tweet in test['tweet_text']:
    print(tweet)

['naomi', 'osaka', 'upset', 'serena', 'williams', 'controversial', 'u', 'open', 'final', 'cnn', 'smartnews']
['go', 'girl', 'got', 'back', 'congrats', 'u', 'open']
['probably', 'felt', 'like', 'friend', 'house', 'mom', 'started', 'yelling', 'usopen']
['congrats', 'girly', 'let', 'anyone', 'take', 'moment', 'outplayed', 'everyone', 'even', 'goat', 'serena']
['naomi', 'osaka', 'defeat', 'serena', 'williams', 'dramatic', 'u', 'open', 'final']
['carlos', 'ramos', 'also', 'robbed', 'osaka', 'imagine', 'much', 'better', 'would', 'feel', 'broke', 'serena', 'go', 'instead', 'giving', 'game']
['yes', 'bravo', 'bajin', 'course', 'love', 'always', 'since', 'turned', 'pro']
['naomi', 'osaka', 'top', 'serena', 'williams', 'open', 'final', 'becomes', 'first', 'japanese', 'grand', 'slam', 'single', 'champion', 'japan', 'time']


Again, removing words that have been reduced to single characters:

In [208]:
test['tweet_text'] = test['tweet_text'].apply(lambda x: [word for word in x if len(word) > 1])

In [209]:
for tweet in test['tweet_text']:
    print(tweet)

['naomi', 'osaka', 'upset', 'serena', 'williams', 'controversial', 'open', 'final', 'cnn', 'smartnews']
['go', 'girl', 'got', 'back', 'congrats', 'open']
['probably', 'felt', 'like', 'friend', 'house', 'mom', 'started', 'yelling', 'usopen']
['congrats', 'girly', 'let', 'anyone', 'take', 'moment', 'outplayed', 'everyone', 'even', 'goat', 'serena']
['naomi', 'osaka', 'defeat', 'serena', 'williams', 'dramatic', 'open', 'final']
['carlos', 'ramos', 'also', 'robbed', 'osaka', 'imagine', 'much', 'better', 'would', 'feel', 'broke', 'serena', 'go', 'instead', 'giving', 'game']
['yes', 'bravo', 'bajin', 'course', 'love', 'always', 'since', 'turned', 'pro']
['naomi', 'osaka', 'top', 'serena', 'williams', 'open', 'final', 'becomes', 'first', 'japanese', 'grand', 'slam', 'single', 'champion', 'japan', 'time']


In [210]:
test

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,"[naomi, osaka, upset, serena, williams, contro...",naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"[go, girl, got, back, congrats, open]",naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,"[probably, felt, like, friend, house, mom, sta...",naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,"[congrats, girly, let, anyone, take, moment, o...",naomi osaka
4,8,Sat Sep 08 19:59:55 +0000 2018,"[naomi, osaka, defeat, serena, williams, drama...",naomi osaka
6,10,Sat Sep 08 19:59:54 +0000 2018,"[carlos, ramos, also, robbed, osaka, imagine, ...",naomi osaka
7,11,Sat Sep 08 19:59:52 +0000 2018,"[yes, bravo, bajin, course, love, always, sinc...",naomi osaka
8,12,Sat Sep 08 19:59:50 +0000 2018,"[tennis, official, coach, seen, coaching, play...",naomi osaka
9,13,Sat Sep 08 19:59:50 +0000 2018,"[naomi, osaka, top, serena, williams, open, fi...",naomi osaka


We've done it! We've completed all the steps to preprocess the data and get it ready for analysis. Next, I'll be writing a script to go through and scrub the database of tweets I've collected so far. We'll save this output as a scrubbed tweets pickle file.

In [212]:
# saving a test scrubbing file
test = df[:100].copy()
test.to_pickle("./data/test.pkl")