# Data Cleaning and Preprocessing

So in the Data Wrangling notebook I noted that I would need to clean and prepare the data for analysis, but I didn't realize just how many steps that would take. I ended up spending an entire notebook wrangling the data, trying to get it in the format I needed it in, and trying to access the specific data that I wanted in the first place (while ignoring irrelevant data).

Now that I've got it in the format I want it in, I think I can take the time to clean it up and get it ready for analysis.

In [21]:
import pandas as pd

In [22]:
df = pd.read_pickle("C:/Users/jzpow/Code/Projects/Naomi-Serena/data/naomi-serena-tweets.pkl")

In [23]:
df.head()

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
5,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
6,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
7,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23828 entries, 0 to 14998
Data columns (total 4 columns):
id              23828 non-null int64
tweet_date      23828 non-null object
tweet_text      23828 non-null object
search query    23828 non-null object
dtypes: int64(1), object(3)
memory usage: 930.8+ KB


In [25]:
df = df.reset_index(drop = True)

In [26]:
df['tweet_text'][:100]

0     Naomi Osaka upsets Serena Williams in controve...
1     @ Naomi_Osaka_ , you go girl! I got your back!...
2     @ Naomi_Osaka_ probably felt like she was at h...
3     Congrats girly, don’t let anyone take this mom...
4     Naomi Osaka defeats Serena Williams in a drama...
5     https://twitter.com/juventino5555/status/10377...
6     Carlos Ramos also robbed Osaka. Imagine how mu...
7     Yes Bravo to @ BigSascha Bajin And of course l...
8     Tennis officials.. where coaches are seen coac...
9     Naomi Osaka tops Serena Williams in U.S. Open ...
10    Booing damn Naomi Osaka won the girl was cooki...
11    You should take the with you @ Naomi_Osaka_ Co...
12    @ Naomi_Osaka_ Congratulaions. . . who were te...
13    The proof that something can be done is when y...
14    Her ambition, she once told a reporter, was “t...
15    [FULL] 2018 US Open trophy ceremony with Seren...
16    What's happening? Part 1. Naomi Osaka vs Seren...
17    @ Naomi_Osaka_ Congratulations You are pro

So from examining a few rows of the tweet data, I can see a few things that can be cleaned up:

* links
* hashtags (i.e. #3StripeLife)
* mentions (i.e. @ Naomi_Osaka_ or @ serenawilliams
* misspellings (i.e. "awaful")
* headlines (i.e. [FULL] 2018 US Open Trophy...)
* slang (i.e. She got screwed by the ump)

I'm not sure I can clean all of these up, but I can at least do a little bit. So let's get started!

*The preprocessing done in this notebook is based on the following tutorial: https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html*

In [29]:
test = df[:10].copy()
test

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
4,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka
5,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...,naomi osaka
6,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,naomi osaka
7,11,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to @ BigSascha Bajin And of course l...,naomi osaka
8,12,Sat Sep 08 19:59:50 +0000 2018,Tennis officials.. where coaches are seen coac...,naomi osaka
9,13,Sat Sep 08 19:59:50 +0000 2018,Naomi Osaka tops Serena Williams in U.S. Open ...,naomi osaka


## Step 0: Text Normalization

What the tutorial calls "noise removal." Noise removal is task-specific, meaning we need to take our data into consideration when removing the noise.

For this tweet data, it looks like we'll want to remove hashtags, links and mentions.

In [30]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNewshttps://edition.cnn.com/2018/09/08/sport/naomi-osaka-serena-williams-us-open-tennis-int-spt/index.html …
@ Naomi_Osaka_ , you go girl! I got your back! Congrats on the US open!
@ Naomi_Osaka_ probably felt like she was at her friend’s house when their mom started yelling at them # usopen
Congrats girly, don’t let anyone take this moment from you..you outplayed everyone, even the GOAT Serena @ Naomi_Osaka_
Naomi Osaka defeats Serena Williams in a dramatic US Open final https://twitter.com/i/events/1038540032330493952 …
https://twitter.com/juventino5555/status/1037768949109276672?s=19 …
Carlos Ramos also robbed Osaka. Imagine how much better she would feel if she broke Serena to go up 5-3 instead of being giving the game.
Yes Bravo to @ BigSascha Bajin And of course love as always (since she turned pro!) to @ Naomi_Osaka_ https://twitter.com/bgtennisnation/status/1038563742961881090 …
Naomi Osaka tops Sere

### 0.0: Remove Hyperlinks

In [31]:
import re

In [32]:
url_pattern = "http[^\s]+\s?…?"

In [33]:
# for tweet in test.loc[test['tweet_text']]:
#     if re.search(url_pattern, tweet) is not None:
#         tweet = re.sub(url_pattern, ' ', tweet)
#     else:
#         pass

In [40]:
test.loc[test['tweet_text'].str.contains("http")] = "TEST"

In [41]:
test.loc[:, 'tweet_text'].str.replace(url_pattern, " ", regex=True)

0                                                 TEST
1    @ Naomi_Osaka_ , you go girl! I got your back!...
2    @ Naomi_Osaka_ probably felt like she was at h...
3    Congrats girly, don’t let anyone take this mom...
4                                                 TEST
5                                                 TEST
6    Carlos Ramos also robbed Osaka. Imagine how mu...
7                                                 TEST
8    Tennis officials.. where coaches are seen coac...
9                                                 TEST
Name: tweet_text, dtype: object

In [42]:
test

Unnamed: 0,id,tweet_date,tweet_text,search query
0,TEST,TEST,TEST,TEST
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
4,TEST,TEST,TEST,TEST
5,TEST,TEST,TEST,TEST
6,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,naomi osaka
7,TEST,TEST,TEST,TEST
8,12,Sat Sep 08 19:59:50 +0000 2018,Tennis officials.. where coaches are seen coac...,naomi osaka
9,TEST,TEST,TEST,TEST


In [18]:
test.iloc[5]

id                                           9
tweet_date      Sat Sep 08 19:59:55 +0000 2018
tweet_text                                    
search query                       naomi osaka
Name: 5, dtype: object

In [75]:
for tweet in test['tweet_text']:
    print(tweet)

Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNews …
@ Naomi_Osaka_ , you go girl! I got your back! Congrats on the US open!
@ Naomi_Osaka_ probably felt like she was at her friend’s house when their mom started yelling at them # usopen
Congrats girly, don’t let anyone take this moment from you..you outplayed everyone, even the GOAT Serena @ Naomi_Osaka_
Naomi Osaka defeats Serena Williams in a dramatic US Open final  …
 …
Carlos Ramos also robbed Osaka. Imagine how much better she would feel if she broke Serena to go up 5-3 instead of being giving the game.
Yes Bravo to @ BigSascha Bajin And of course love as always (since she turned pro!) to @ Naomi_Osaka_  …
Naomi Osaka tops Serena Williams in U.S. Open final, becomes first Japanese grand slam singles champion | The Japan Times  …
nan


### 0.1: Remove Mentions

In [77]:
mentions_pattern = '@\s[A-Za-z0-9_]+'

In [81]:
test.loc[:, 'tweet_text'].str.replace(mentions_pattern, " ", regex=True)

0             Naomi Osaka upsets Serena Williams in controve...
1               , you go girl! I got your back! Congrats on ...
2               probably felt like she was at her friend’s h...
3             Congrats girly, don’t let anyone take this mom...
4             Naomi Osaka defeats Serena Williams in a drama...
5                                                             …
6             Carlos Ramos also robbed Osaka. Imagine how mu...
7             Yes Bravo to   Bajin And of course love as alw...
8             Tennis officials.. where coaches are seen coac...
9             Naomi Osaka tops Serena Williams in U.S. Open ...
tweet_text                                                  NaN
Name: tweet_text, dtype: object

In [82]:
test

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1.0,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2.0,Sat Sep 08 19:59:57 +0000 2018,", you go girl! I got your back! Congrats on ...",naomi osaka
2,6.0,Sat Sep 08 19:59:56 +0000 2018,probably felt like she was at her friend’s h...,naomi osaka
3,7.0,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
4,8.0,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka
5,9.0,Sat Sep 08 19:59:55 +0000 2018,…,naomi osaka
6,10.0,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,naomi osaka
7,11.0,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to Bajin And of course love as alw...,naomi osaka
8,12.0,Sat Sep 08 19:59:50 +0000 2018,Tennis officials.. where coaches are seen coac...,naomi osaka
9,13.0,Sat Sep 08 19:59:50 +0000 2018,Naomi Osaka tops Serena Williams in U.S. Open ...,naomi osaka


I figured out how to update the rows without getting the `SettingwithCopyWarning`: you have to indicate all rows `:` and then the desired column. I'm going to update the above code to reflect this.