# Data Wrangling

Now that I've got the data I need, I'll have to clean it up.

It looks like there will be two parts to the data cleaning process:

1. Text Cleansing
2. Text Preparation

## Text Cleansing
This will be general cleaning of the text to:
* remove special characters, hyperlinks and mentions to other users
* convert all text to lowercase
* removing stop words
* replacing abbreviations
* spelling correction

Here are some resources I can use to help with this part of the process:
* [Data Cleaning](https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-2-333514854913)

## Text Preparation
This will prepare the text for analysis by:
* stemming
* lemmatization
* POS tagging

Here are some resources I can use to help with this part of the process:

* [Basic Data Engineering](https://medium.com/@SeoJaeDuk/basic-data-cleaning-engineering-session-twitter-sentiment-data-b9376a91109b)

Other useful resources:

* https://codereview.stackexchange.com/questions/163446/cleaning-and-extracting-meaningful-text-from-tweets
* https://www.kdnuggets.com/2017/11/framework-approaching-textual-data-tasks.html
* https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html
* https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html

In [1]:
import json
import sys
sys.path.append("data/")

In [2]:
file = "C:/Users/jzpow/Code/Projects/Naomi-Osaka/data/Naomi_Eng.csv"

In [3]:
import pandas as pd

In [4]:
pd.read_csv(file)

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
0,1,1038577721113161728,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,
1,2,1038577715278954496,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",
2,3,1038577714662461445,Sat Sep 08 19:59:57 +0000 2018,,
3,4,1038577711881613317,Sat Sep 08 19:59:57 +0000 2018,Ofusca a imagem da Serena apenas. Ela se perde...,
4,5,1038577711566872577,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！,
5,6,1038577708962332680,Sat Sep 08 19:59:56 +0000 2018,@ Naomi_Osaka_ probably felt like she was at h...,
6,7,1038577706953318405,Sat Sep 08 19:59:55 +0000 2018,"Congrats girly, don’t let anyone take this mom...",
7,8,1038577706257055744,Sat Sep 08 19:59:55 +0000 2018,Naomi Osaka defeats Serena Williams in a drama...,
8,9,1038577705774669825,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...,
9,10,1038577701676875776,Sat Sep 08 19:59:54 +0000 2018,Carlos Ramos also robbed Osaka. Imagine how mu...,


In [5]:
df = pd.DataFrame(pd.read_csv(file))

In [6]:
df.head()

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
0,1,1038577721113161728,Sat Sep 08 19:59:59 +0000 2018,Naomi Osaka upsets Serena Williams in controve...,
1,2,1038577715278954496,Sat Sep 08 19:59:57 +0000 2018,"@ Naomi_Osaka_ , you go girl! I got your back!...",
2,3,1038577714662461445,Sat Sep 08 19:59:57 +0000 2018,,
3,4,1038577711881613317,Sat Sep 08 19:59:57 +0000 2018,Ofusca a imagem da Serena apenas. Ela se perde...,
4,5,1038577711566872577,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！,


In [7]:
df.tail()

Unnamed: 0,id,tweet_id,tweet_date,tweet_text,tweet_loc
14995,14996,1038553806118707200,Sat Sep 08 18:24:57 +0000 2018,Class acts don't smash rackets then verbally a...,
14996,14997,1038553806085050368,Sat Sep 08 18:24:57 +0000 2018,なおみちゃん、全米オープン優勝おめでとう嬉しくて…涙が出ちゃったよ。,
14997,14998,1038553805959389184,Sat Sep 08 18:24:57 +0000 2018,¡Nace una Estrella! Naomi Osaka venció a Seren...,
14998,14999,1038553804801761281,Sat Sep 08 18:24:57 +0000 2018,"It was obvious that Williams loss, at least th...",
14999,15000,1038553804298436615,Sat Sep 08 18:24:56 +0000 2018,Once again the racist media is trying to deny ...,


In [8]:
df.index

RangeIndex(start=0, stop=15000, step=1)

In [9]:
df.columns

Index(['id', 'tweet_id', 'tweet_date', 'tweet_text', 'tweet_loc'], dtype='object')

In [10]:
df.values

array([[1, 1038577721113161728, 'Sat Sep 08 19:59:59 +0000 2018',
        'Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNewshttps://edition.cnn.com/2018/09/08/sport/naomi-osaka-serena-williams-us-open-tennis-int-spt/index.html …',
        nan],
       [2, 1038577715278954496, 'Sat Sep 08 19:59:57 +0000 2018',
        '@ Naomi_Osaka_ , you go girl! I got your back! Congrats on the US open!',
        nan],
       [3, 1038577714662461445, 'Sat Sep 08 19:59:57 +0000 2018', nan,
        nan],
       ...,
       [14998, 1038553805959389184, 'Sat Sep 08 18:24:57 +0000 2018',
        '¡Nace una Estrella! Naomi Osaka venció a Serena Williams por 6-2 y 6-4 en la final del # USOpen y se convirtió en la primera japonesa en ganar un Gran Slam y en la campeona más joven del torneo desde Maria Sharapova (19 años) que se quedó con el trofeo en el 2006.pic.twitter.com/O1pnar1BC6',
        nan],
       [14999, 1038553804801761281, 'Sat Sep 08 18:24:57 +0000 2018',
     

In [11]:
df.describe()

Unnamed: 0,id,tweet_id,tweet_loc
count,15000.0,15000.0,0.0
mean,7500.5,1.038563e+18,
std,4330.271354,6931693000000.0,
min,1.0,1.038554e+18,
25%,3750.75,1.038557e+18,
50%,7500.5,1.038562e+18,
75%,11250.25,1.038569e+18,
max,15000.0,1.038578e+18,


## Step #0: Remove Unnecessary Columns

Ok, so just by doing this I can see that there was not a single variable stored in the location category. So as much as I would've liked to do the geographical analysis. I'm not sure it will be possible. Also, I believe each tweet gets a new tweet ID, even if it's a retweet, so I don't think I'll be needing this column after all. I should drop these columns from the data.

In [13]:
df = df.drop(columns=["tweet_loc", "tweet_id"])

KeyError: "['tweet_loc' 'tweet_id'] not found in axis"

## Step #1: Drop Duplicates and NaNs

I'll be dropping duplicates on `tweet_text` instead (that is, if it's the same exact tweet more than once, I don't want to see it).

In [14]:
df = df.drop_duplicates("tweet_text")

In [15]:
df.shape

(14398, 3)

In [16]:
df = df.dropna()

In [17]:
df.shape

(14397, 3)

Looks like there was only one tweet with no text in it, but we've successfully removed it from the data.

## Step #2: Remove Non-English Words

This is going to be a bit tricky, but I want to try it as best I can. According to the Internet, there are Python modules that check if a word is in English or not. I'll see if I can use this to drop all non-English tweets from the dataframe. 

In [18]:
from langdetect import detect

In [19]:
eng_str = "This is an English string."
jp_str = "日本語のキーボード"

In [20]:
detect(eng_str)

'en'

In [21]:
detect(jp_str)

'ja'

In [22]:
def is_eng(s):
    lang = detect(s)
    return lang == 'en'

In [23]:
df["tweet_text"].values

array(['Naomi Osaka upsets Serena Williams in controversial US Open final - CNN # SmartNewshttps://edition.cnn.com/2018/09/08/sport/naomi-osaka-serena-williams-us-open-tennis-int-spt/index.html …',
       '@ Naomi_Osaka_ , you go girl! I got your back! Congrats on the US open!',
       'Ofusca a imagem da Serena apenas. Ela se perdeu e deu um vexame. A japonesa Naomi Osaka deu um passeio na quadra.',
       ...,
       '¡Nace una Estrella! Naomi Osaka venció a Serena Williams por 6-2 y 6-4 en la final del # USOpen y se convirtió en la primera japonesa en ganar un Gran Slam y en la campeona más joven del torneo desde Maria Sharapova (19 años) que se quedó con el trofeo en el 2006.pic.twitter.com/O1pnar1BC6',
       'It was obvious that Williams loss, at least the way she lost is the equivalent to Tyson biting the ear of Holyfield. Williams was losing to a better player (on this day) and was looking for a way out! Congrats though to Naomi Osaka and Japan!',
       'Once again the racist 

In [24]:
def my_detect(n):
  return lambda n : detect(n)

In [25]:
lang_detector = my_detect(is_eng)

lang_detector(eng_str)

'en'

In [26]:
lang_detector(jp_str)

'ja'

In [27]:
type(df["tweet_text"])

pandas.core.series.Series

In [28]:
# figured out how to lowercase all strings
df['tweet_text'] = df['tweet_text'].str.lower()

In [29]:
df.head()

Unnamed: 0,id,tweet_date,tweet_text
0,1,Sat Sep 08 19:59:59 +0000 2018,naomi osaka upsets serena williams in controve...
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ naomi_osaka_ , you go girl! i got your back!..."
3,4,Sat Sep 08 19:59:57 +0000 2018,ofusca a imagem da serena apenas. ela se perde...
4,5,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！
5,6,Sat Sep 08 19:59:56 +0000 2018,@ naomi_osaka_ probably felt like she was at h...


In [30]:
df['tweet_text'].str.lang_detector

AttributeError: 'StringMethods' object has no attribute 'lang_detector'

In [30]:
qz = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 
                   'B': ['a', 'a', 'b', 'c', 'c'], 
                   'C': ['a', 'a', 'b', 'b', 'c']})



In [31]:
qz

Unnamed: 0,A,B,C
0,a,a,a
1,b,a,a
2,b,b,b
3,c,c,b
4,c,c,c


In [32]:
for col in qz:
    vc = qz[col].value_counts()
    vals_to_remove = vc[vc <= 1].index.values
    qz[col].loc[qz[col].isin(vals_to_remove)] = None


In [33]:
qz

Unnamed: 0,A,B,C
0,,a,a
1,b,a,a
2,b,,b
3,c,c,b
4,c,c,


In [43]:
type(df["tweet_text"].values)

numpy.ndarray

In [31]:
test = df[:10]
test

Unnamed: 0,id,tweet_date,tweet_text
0,1,Sat Sep 08 19:59:59 +0000 2018,naomi osaka upsets serena williams in controve...
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ naomi_osaka_ , you go girl! i got your back!..."
3,4,Sat Sep 08 19:59:57 +0000 2018,ofusca a imagem da serena apenas. ela se perde...
4,5,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！
5,6,Sat Sep 08 19:59:56 +0000 2018,@ naomi_osaka_ probably felt like she was at h...
6,7,Sat Sep 08 19:59:55 +0000 2018,"congrats girly, don’t let anyone take this mom..."
7,8,Sat Sep 08 19:59:55 +0000 2018,naomi osaka defeats serena williams in a drama...
8,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...
9,10,Sat Sep 08 19:59:54 +0000 2018,carlos ramos also robbed osaka. imagine how mu...
10,11,Sat Sep 08 19:59:52 +0000 2018,yes bravo to @ bigsascha bajin and of course l...


In [38]:
for col in test:
    vc = test[col].value_counts()
    vals_to_ignore = 'en'
#     vals_to_remove = vc[vc <= 1].index.values
#     qz[col].loc[qz[col].isin(vals_to_remove)] = None
    print(vc)

11    1
10    1
9     1
8     1
7     1
6     1
5     1
4     1
2     1
1     1
Name: id, dtype: int64
Sat Sep 08 19:59:55 +0000 2018    3
Sat Sep 08 19:59:57 +0000 2018    2
Sat Sep 08 19:59:56 +0000 2018    2
Sat Sep 08 19:59:59 +0000 2018    1
Sat Sep 08 19:59:54 +0000 2018    1
Sat Sep 08 19:59:52 +0000 2018    1
Name: tweet_date, dtype: int64
大阪ハンパないって！！！ そんなんできひんやん普通！！！                                                                                                                                                                   1
carlos ramos also robbed osaka. imagine how much better she would feel if she broke serena to go up 5-3 instead of being giving the game.                                                      1
naomi osaka upsets serena williams in controversial us open final - cnn # smartnewshttps://edition.cnn.com/2018/09/08/sport/naomi-osaka-serena-williams-us-open-tennis-int-spt/index.html …    1
yes bravo to @ bigsascha bajin and of course love as always (since she 

In [47]:
test["tweet_text"]

0     naomi osaka upsets serena williams in controve...
1     @ naomi_osaka_ , you go girl! i got your back!...
3     ofusca a imagem da serena apenas. ela se perde...
4                          大阪ハンパないって！！！ そんなんできひんやん普通！！！
5     @ naomi_osaka_ probably felt like she was at h...
6     congrats girly, don’t let anyone take this mom...
7     naomi osaka defeats serena williams in a drama...
8     https://twitter.com/juventino5555/status/10377...
9     carlos ramos also robbed osaka. imagine how mu...
10    yes bravo to @ bigsascha bajin and of course l...
Name: tweet_text, dtype: object

In [44]:
test.loc[test['tweet_text'].apply(lambda x: len(x) <= 100)]

Unnamed: 0,id,tweet_date,tweet_text
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ naomi_osaka_ , you go girl! i got your back!..."
4,5,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！
8,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...


In [68]:
japanese_letters = "[ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ゗゘゙゚゛゜ゝゞゟァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾヿ一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乎乏乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿]+"

In [69]:
test

Unnamed: 0,id,tweet_date,tweet_text
0,1,Sat Sep 08 19:59:59 +0000 2018,naomi osaka upsets serena williams in controve...
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ naomi_osaka_ , you go girl! i got your back!..."
3,4,Sat Sep 08 19:59:57 +0000 2018,ofusca a imagem da serena apenas. ela se perde...
4,5,Sat Sep 08 19:59:56 +0000 2018,大阪ハンパないって！！！ そんなんできひんやん普通！！！
5,6,Sat Sep 08 19:59:56 +0000 2018,@ naomi_osaka_ probably felt like she was at h...
6,7,Sat Sep 08 19:59:55 +0000 2018,"congrats girly, don’t let anyone take this mom..."
7,8,Sat Sep 08 19:59:55 +0000 2018,naomi osaka defeats serena williams in a drama...
8,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...
9,10,Sat Sep 08 19:59:54 +0000 2018,carlos ramos also robbed osaka. imagine how mu...
10,11,Sat Sep 08 19:59:52 +0000 2018,yes bravo to @ bigsascha bajin and of course l...


In [70]:
import re
japanese="こんにちは"
re.search(japanese_letters, japanese)

<_sre.SRE_Match object; span=(0, 5), match='こんにちは'>

In [71]:
for p in test["tweet_text"]:
    print(re.search(japanese_letters, p))

None
None
None
<_sre.SRE_Match object; span=(2, 9), match='ハンパないって'>
None
None
None
None
None
None


In [74]:
test.loc[test['tweet_text'].apply(lambda x: re.search(japanese_letters, x) == None)]

Unnamed: 0,id,tweet_date,tweet_text
0,1,Sat Sep 08 19:59:59 +0000 2018,naomi osaka upsets serena williams in controve...
1,2,Sat Sep 08 19:59:57 +0000 2018,"@ naomi_osaka_ , you go girl! i got your back!..."
3,4,Sat Sep 08 19:59:57 +0000 2018,ofusca a imagem da serena apenas. ela se perde...
5,6,Sat Sep 08 19:59:56 +0000 2018,@ naomi_osaka_ probably felt like she was at h...
6,7,Sat Sep 08 19:59:55 +0000 2018,"congrats girly, don’t let anyone take this mom..."
7,8,Sat Sep 08 19:59:55 +0000 2018,naomi osaka defeats serena williams in a drama...
8,9,Sat Sep 08 19:59:55 +0000 2018,https://twitter.com/juventino5555/status/10377...
9,10,Sat Sep 08 19:59:54 +0000 2018,carlos ramos also robbed osaka. imagine how mu...
10,11,Sat Sep 08 19:59:52 +0000 2018,yes bravo to @ bigsascha bajin and of course l...


Yes!!! I got it to work!

Okay, so instead of trying to use a fancy Python module that wasn't doing what I needed it to, I decided to go back to the basics. Put a bunch of Japanese characters in a string, and call regex to cull out the tweets that had Japanese characters in them. Now, that's not going to get rid of all the tweets of course, but I think I can use it to at least rule out some other languages (like Portuguese) that have some strange diacritics in them.