# Data Wrangling

Now that I've got the data I need, I'll have to clean it up.

It looks like there will be two parts to the data cleaning process:

1. Text Cleansing
2. Text Preparation

## Text Cleansing
This will be general cleaning of the text to:
* remove special characters, hyperlinks and mentions to other users
* convert all text to lowercase
* removing stop words
* replacing abbreviations
* spelling correction

Here are some resources I can use to help with this part of the process:
* [Data Cleaning](https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-2-333514854913)

## Text Preparation
This will prepare the text for analysis by:
* stemming
* lemmatization
* POS tagging

Here are some resources I can use to help with this part of the process:

* [Basic Data Engineering](https://medium.com/@SeoJaeDuk/basic-data-cleaning-engineering-session-twitter-sentiment-data-b9376a91109b)

Other useful resources:

* https://codereview.stackexchange.com/questions/163446/cleaning-and-extracting-meaningful-text-from-tweets
* https://www.kdnuggets.com/2017/11/framework-approaching-textual-data-tasks.html
* https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html
* https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html

In [1]:
import json
import sys
sys.path.append("data/")

In [2]:
file = "C:/Users/jzpow/Code/Projects/Naomi-Serena/data/Naomi_Eng.csv"

In [3]:
import pandas as pd

In [4]:
pd.read_csv(file)

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
0,1,1038577721113161728,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,
1,2,1038577715278954496,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",
2,3,1038577714662461445,Sat Sep 08 19:59:57 +0000 2018,,
3,4,1038577711881613317,Sat Sep 08 19:59:57 +0000 2018,Ofusca a imagem da Serena apenas. Ela se perde...,
4,5,1038577711566872577,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！,
5,6,1038577708962332680,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,
6,7,1038577706953318405,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",
7,8,1038577706257055744,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,
8,9,1038577705774669825,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...,
9,10,1038577701676875776,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,


In [5]:
df = pd.DataFrame(pd.read_csv(file))

In [6]:
df.head()

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
0,1,1038577721113161728,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,
1,2,1038577715278954496,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",
2,3,1038577714662461445,Sat Sep 08 19:59:57 +0000 2018,,
3,4,1038577711881613317,Sat Sep 08 19:59:57 +0000 2018,Ofusca a imagem da Serena apenas. Ela se perde...,
4,5,1038577711566872577,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！,


In [7]:
df.tail()

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
14995,14996,1038553806118707200,Sat Sep 08 18:24:57 +0000 2018,Class acts don't smash rackets then verbally a...,
14996,14997,1038553806085050368,Sat Sep 08 18:24:57 +0000 2018,なおみちゃん、全米オープン優勝おめでとう嬉しくて…涙が出ちゃったよ。,
14997,14998,1038553805959389184,Sat Sep 08 18:24:57 +0000 2018,¡Nace una Estrella! Naomi Osaka venció a Seren...,
14998,14999,1038553804801761281,Sat Sep 08 18:24:57 +0000 2018,"It was obvious that Williams loss, at least th...",
14999,15000,1038553804298436615,Sat Sep 08 18:24:56 +0000 2018,Once again the racist media is trying to deny ...,


In [8]:
df.index

RangeIndex(start=0, stop=15000, step=1)

In [9]:
df.columns

Index(['id', 'tweet_id', 'tweet_date', 'tweet_text', 'tweet_loc'], dtype='object')

In [10]:
df.values

array([[1, 1038577721113161728, 'Sat Sep 08 19:59:59 +0000 2018',
        'Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNewshttps://edition.cnn.com/2018/09/08/sport/naomi-osaka-serena-williams-us-open-tennis-int-spt/index.html …',
        nan],
       [2, 1038577715278954496, 'Sat Sep 08 19:59:57 +0000 2018',
        '@ Naomi_Osaka_ , you go girl! I got your back! Congrats on the US open!',
        nan],
       [3, 1038577714662461445, 'Sat Sep 08 19:59:57 +0000 2018', nan,
        nan],
       ...,
       [14998, 1038553805959389184, 'Sat Sep 08 18:24:57 +0000 2018',
        '¡Nace una Estrella! Naomi Osaka venció a Serena Williams por 6-2 y 6-4 en la final del # USOpen y se convirtió en la primera japonesa en ganar un Gran Slam y en la campeona más joven del torneo desde Maria Sharapova (19 años) que se quedó con el trofeo en el 2006.pic.twitter.com/O1pnar1BC6',
        nan],
       [14999, 1038553804801761281, 'Sat Sep 08 18:24:57 +0000 2018',
     

In [11]:
df.describe()

Unnamed: 0,id,tweet_id,tweet_loc
count,15000.0,15000.0,0.0
mean,7500.5,1.038563e+18,
std,4330.271354,6931693000000.0,
min,1.0,1.038554e+18,
25%,3750.75,1.038557e+18,
50%,7500.5,1.038562e+18,
75%,11250.25,1.038569e+18,
max,15000.0,1.038578e+18,


## Step #0: Remove Unnecessary Columns

Ok, so just by doing this I can see that there was not a single variable stored in the location category. So as much as I would've liked to do the geographical analysis. I'm not sure it will be possible. Also, I believe each tweet gets a new tweet ID, even if it's a retweet, so I don't think I'll be needing this column after all. I should drop these columns from the data.

In [12]:
df = df.drop(columns=["tweet_loc", "tweet_id"])

## Step #1: Drop Duplicates and NaNs

I'll be dropping duplicates on `tweet_text` instead (that is, if it's the same exact tweet more than once, I don't want to see it).

In [13]:
df = df.drop_duplicates("tweet_text")

In [14]:
df.shape

(14398, 3)

In [15]:
df = df.dropna()

In [16]:
df.shape

(14397, 3)

Looks like there was only one tweet with no text in it, but we've successfully removed it from the data.

## Step #2: Remove Non-English Words

This is going to be a bit tricky, but I want to try it as best I can. According to the Internet, there are Python modules that check if a word is in English or not. I'll see if I can use this to drop all non-English tweets from the dataframe.

### ETA: 2018.10.07

I found a Python package called `TextBlob` that does sentiment analysis, among other things, and it also includes language detection. While I figured out a rudimentary solution on my own (after failing to get `langdetect` to work), I want to learn how to use thie package because I think it will be super useful to me in the future. So I've turned the old code to markdown cells below.

from langdetect import detect

eng_str = "This is an English string."
jp_str = "日本語のキーボード"

detect(eng_str)

detect(jp_str)

def is_eng(s):
    lang = detect(s)
    return lang == 'en'

df["tweet_text"].values

def my_detect(n):
  return lambda n : detect(n)

lang_detector = my_detect(is_eng)

lang_detector(eng_str)

lang_detector(jp_str)

type(df["tweet_text"])

"figured out how to lowercase all strings"
df['tweet_text'] = df['tweet_text'].str.lower()

df.head()

df['tweet_text'].str.lang_detector

qz = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 
                   'B': ['a', 'a', 'b', 'c', 'c'], 
                   'C': ['a', 'a', 'b', 'b', 'c']})



qz

for col in qz:
    vc = qz[col].value_counts()
    vals_to_remove = vc[vc <= 1].index.values
    qz[col].loc[qz[col].isin(vals_to_remove)] = None


qz

type(df["tweet_text"].values)

test = df[:10]
test

for col in test:
    vc = test[col].value_counts()
    vals_to_ignore = 'en'
     "vals_to_remove = vc[vc <= 1].index.values"
     "qz[col].loc[qz[col].isin(vals_to_remove)] = None"
    print(vc)

test["tweet_text"]

test.loc[test['tweet_text'].apply(lambda x: len(x) <= 100)]

japanese_letters = "[ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ゗゘゙゚゛゜ゝゞゟァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾヿ一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乎乏乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿]+"

test

import re
japanese="こんにちは"
re.search(japanese_letters, japanese)

for p in test["tweet_text"]:
    print(re.search(japanese_letters, p))

test.loc[test['tweet_text'].apply(lambda x: re.search(japanese_letters, x) == None)]

Yes!!! I got it to work!

Okay, so instead of trying to use a fancy Python module that wasn't doing what I needed it to, I decided to go back to the basics. Put a bunch of Japanese characters in a string, and call regex to cull out the tweets that had Japanese characters in them. Now, that's not going to get rid of all the tweets of course, but I think I can use it to at least rule out some other languages (like Portuguese) that have some strange diacritics in them.

## Step 2.1: Remove Non-English Words
### October 7, 2018

So I want to use the TextBlob package to try and do this.

In [17]:
from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger

tb_test = df[:10]
tb_test

tb_test.loc[tb_test['tweet_text'].apply(lambda x: TextBlob(x).detect_language() == "en")]

Yes!!! That did it! This package is awesome! Just for fun, let's see what other languages we can find:

### ETA: Aaaand this is why shit doesn't get done...

Apparently Google blocks your IP address after a few test calls to their API. Either that, or the API has been shut off from public use and would require registration or even payment. So it looks like I'm back to brute forcin' it.

In [18]:
import langid

In [19]:
tiat = "This is a text."
langid.classify(tiat)[0]

'en'

In [20]:
new_test = df[:10]
new_test

Unnamed: 0,id,tweet_date,tweet_text
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!..."
3,4,Sat Sep 08 19:59:57 +0000 2018,Ofusca a imagem da Serena apenas. Ela se perde...
4,5,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！
5,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...
6,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom..."
7,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...
8,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...
9,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...
10,11,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to @ BigSascha Bajin And of course l...


In [21]:
new_test[new_test['tweet_text'].apply(lambda x: langid.classify(x)[0] == 'en')]

Unnamed: 0,id,tweet_date,tweet_text
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!..."
5,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...
6,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom..."
7,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...
8,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...
9,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...
10,11,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to @ BigSascha Bajin And of course l...


I'm a little wary to get my hopes up... but it finally looks like I found a workable solution! Unlike `langdetect`, `langid` works on small strings of text, and unlike `TextBlob`'s Google API, it won't rate-limit me. Let's try this again:

In [22]:
tttc = df[:100] #tttc: third time's the charm!

In [23]:
tttc[tttc['tweet_text'].apply(lambda x: langid.classify(x)[0] == 'ja')]

Unnamed: 0,id,tweet_date,tweet_text
4,5,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！
32,33,Sat Sep 08 19:59:41 +0000 2018,優勝おめでとうございます 日本中の人々が勇気づけされました(^^)ありがとうございます
46,47,Sat Sep 08 19:59:28 +0000 2018,来シーズンに期待
75,76,Sat Sep 08 19:59:14 +0000 2018,優勝おめでとうございます
87,88,Sat Sep 08 19:59:08 +0000 2018,@ Naomi_Osaka_ あなたの美しい
88,89,Sat Sep 08 19:59:06 +0000 2018,CONGRATS @ Naomi_Osaka_!!!! おめでとう!!https://twi...


Yes!! Yes, Yes, YES!! I think I got this! Ok, so now let's reduce the original dataframe to just its English tweets:

In [24]:
df = df[df['tweet_text'].apply(lambda x: langid.classify(x)[0] == 'en')]

Wow! This got us down to 11,406 tweets out of the 15,000 we started with! That means around 5k tweets aren't in English. I wonder if they're all Japanese?

Okay, so far (as of 12AM on Oct. 8, 2018) I've completed the following steps:
1. Read in the data
  * convert to a Pandas dataframe
2. Exploratory data analysis
  * look at `.head()` and `.tail()`
  * look at `.index`, `.columns`, `.values` and `.describe()`
3. Remove unnecessary data
  * drop `'tweet_loc'` and `'tweet_id'`
  * drop duplicate tweets and NaNs in `'tweet_text'`
  * use `langid` to remove non-English tweets
  
Now I can actually start to clean up the tweets. What I think I will do is go ahead and pull in Serena's tweets as well, and then add a new column `'search query'` to be able to distinguish them. I want to go ahead and clean up all the data at once. Besides, there might be some tweets in the Serena_Eng table that overlap with the ones in the Naomi_Eng table, and I want to go ahead and deal with that now.

Once I've got both tables pulled in and ready to prepare, I'll move into the **Text Cleansing** and **Text Preparation** steps I outlined above.

In [25]:
# Get the necessary tweets from Serena's table
serena_file = "C:/Users/jzpow/Code/Projects/Naomi-Serena/data/Serena_Eng.csv"
serena_df = pd.DataFrame(pd.read_csv(serena_file))

In [26]:
serena_df.head()

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
0,1,1038577724082868224,Sat Sep 08 19:59:59 +0000 2018,@ serenawilliams come back stronger,
1,2,1038577723877154816,Sat Sep 08 19:59:59 +0000 2018,@ serenawilliams WELL for years I’ve tried to ...,
2,3,1038577722522574849,Sat Sep 08 19:59:59 +0000 2018,"Após derrota na decisão do US Open, Serena Wil...",
3,4,1038577722434486277,Sat Sep 08 19:59:59 +0000 2018,To be fair it was a meltdown whether justified...,
4,5,1038577721339564032,Sat Sep 08 19:59:59 +0000 2018,The most important is the triumph of # Osaka N...,


In [27]:
serena_df.tail()

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
14995,14996,1038558953540644869,Sat Sep 08 18:45:24 +0000 2018,"Noah, por favor, você sabe a que horas é a fin...",
14996,14997,1038558952710135809,Sat Sep 08 18:45:24 +0000 2018,@ serenawilliams Mama needs a dictionary. Grac...,
14997,14998,1038558948645847041,Sat Sep 08 18:45:23 +0000 2018,I’d rather lose than cheat @ serenawilliams,
14998,14999,1038558948234735616,Sat Sep 08 18:45:23 +0000 2018,"Sorry, don't agree on this one. Referee had no...",
14999,15000,1038558948033523712,Sat Sep 08 18:45:23 +0000 2018,Ténis - Árbitro português tinha razão: treinad...,


In [28]:
serena_df.index

RangeIndex(start=0, stop=15000, step=1)

In [29]:
serena_df.columns

Index(['id', 'tweet_id', 'tweet_date', 'tweet_text', 'tweet_loc'], dtype='object')

In [30]:
serena_df.values

array([[1, 1038577724082868224, 'Sat Sep 08 19:59:59 +0000 2018',
        '@ serenawilliams come back stronger', nan],
       [2, 1038577723877154816, 'Sat Sep 08 19:59:59 +0000 2018',
        '@ serenawilliams WELL for years I’ve tried to give @ serenawilliams benefit of the doubt Turns out I’ve been right all along @ Serena you’re a # BULLY This will be a match that you’ll be proud to share with your daughter # TeamOsaka',
        nan],
       [3, 1038577722522574849, 'Sat Sep 08 19:59:59 +0000 2018',
        'Após derrota na decisão do US Open, Serena Williams acusa árbitro de sexismo! http://oespresso.com.br/posted-links-right/apos-derrota-na-decisao-do-us-open-serena-williams-acusa-arbitro-de-sexismo …',
        nan],
       ...,
       [14998, 1038558948645847041, 'Sat Sep 08 18:45:23 +0000 2018',
        'I’d rather lose than cheat @ serenawilliams', nan],
       [14999, 1038558948234735616, 'Sat Sep 08 18:45:23 +0000 2018',
        "Sorry, don't agree on this one. Referee had n

In [31]:
serena_df.describe()

Unnamed: 0,id,tweet_id,tweet_loc
count,15000.0,15000.0,0.0
mean,7500.5,1.038567e+18,
std,4330.271354,5431422000000.0,
min,1.0,1.038559e+18,
25%,3750.75,1.038562e+18,
50%,7500.5,1.038567e+18,
75%,11250.25,1.038572e+18,
max,15000.0,1.038578e+18,


In [32]:
serena_df = serena_df.drop(columns=["tweet_loc", "tweet_id"])
serena_df = serena_df.drop_duplicates("tweet_text")
serena_df = serena_df.dropna()

In [33]:
serena_df.shape

(14483, 3)

In [34]:
df.shape

(11406, 3)

In [35]:
serena_df = serena_df[serena_df['tweet_text'].apply(lambda x: langid.classify(x)[0] == 'en')]

In [36]:
serena_df.shape

(12422, 3)

Okay! We seem to have slightly more English-language tweets when searching "Serena Williams," but on average we removed about 3000 non-English tweets from both tables. Now we'll join them together to create our English tweet dataframe:

In [37]:
df['search query'] = 'naomi osaka'

In [38]:
df.head()

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
5,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
6,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
7,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka


In [39]:
serena_df['search query'] = 'serena williams'

In [40]:
serena_df.tail()

Unnamed: 0,id,tweet_date,tweet_text,search query
14993,14994,Sat Sep 08 18:45:25 +0000 2018,@ serenawilliams you are an inspiration and a ...,serena williams
14994,14995,Sat Sep 08 18:45:24 +0000 2018,@ serenawilliams I love you so much mommy. I k...,serena williams
14996,14997,Sat Sep 08 18:45:24 +0000 2018,@ serenawilliams Mama needs a dictionary. Grac...,serena williams
14997,14998,Sat Sep 08 18:45:23 +0000 2018,I’d rather lose than cheat @ serenawilliams,serena williams
14998,14999,Sat Sep 08 18:45:23 +0000 2018,"Sorry, don't agree on this one. Referee had no...",serena williams


In [41]:
naomi_serena_tweets = pd.concat([df, serena_df])

In [42]:
naomi_serena_tweets.head()

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
5,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
6,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
7,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka


In [43]:
naomi_serena_tweets.tail()

Unnamed: 0,id,tweet_date,tweet_text,search query
14993,14994,Sat Sep 08 18:45:25 +0000 2018,@ serenawilliams you are an inspiration and a ...,serena williams
14994,14995,Sat Sep 08 18:45:24 +0000 2018,@ serenawilliams I love you so much mommy. I k...,serena williams
14996,14997,Sat Sep 08 18:45:24 +0000 2018,@ serenawilliams Mama needs a dictionary. Grac...,serena williams
14997,14998,Sat Sep 08 18:45:23 +0000 2018,I’d rather lose than cheat @ serenawilliams,serena williams
14998,14999,Sat Sep 08 18:45:23 +0000 2018,"Sorry, don't agree on this one. Referee had no...",serena williams


In [44]:
naomi_serena_tweets.reset_index(drop=True)

Unnamed: 0,id,tweet_date,tweet_text,search query
0,1,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,naomi osaka
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",naomi osaka
2,6,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,naomi osaka
3,7,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",naomi osaka
4,8,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,naomi osaka
5,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...,naomi osaka
6,10,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,naomi osaka
7,11,Sat Sep 08 19:59:52 +0000 2018,Yes Bravo to @ BigSascha Bajin And of course l...,naomi osaka
8,12,Sat Sep 08 19:59:50 +0000 2018,Tennis officials.. where coaches are seen coac...,naomi osaka
9,13,Sat Sep 08 19:59:50 +0000 2018,Naomi Osaka tops Serena Williams in U.S. Open ...,naomi osaka


I'm saving this dataframe as a `.pkl` file so that I can continue to process and cleanse the data in another notebook. This one is already taking a while to run.

In [45]:
naomi_serena_tweets.to_pickle('C:/Users/jzpow/Code/Projects/Naomi-Serena/data/naomi-serena-tweets.pkl')