## From https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-2-333514854913

#### 1. The first issue I realised is that, during the cleaning process, negation words are split into two parts, and the ‘t’ after the apostrophe vanishes when I filter tokens with length more than one syllable.

#### 2. The second issue I realised is that, some of the url link doesn’t start with “http”, sometimes people paste link in “www.websitename.com" form. And another problem of this regex pattern is that it only detects alphabet, number, period, slash. This means it will fail to catch the part of the url, if it contains any other special character such as “=”, “_”, “~”, etc.

#### 3. The third issue is with the regex pattern for twitter ID. In the previous cleaning function I defined it as ‘@[A-Za-z0–9]+’, but with a little googling, I found out that twitter ID also allows underscore symbol as a character can be used with ID.

## #Below is the updated datacleaning function. The order of the cleaning is

###### 1.Souping
###### 2.BOM removing
###### 3.url address(‘http:’pattern), twitter ID removing
###### 4.url address(‘www.'pattern) removing   <-
###### 5.lower-case
###### 6.negation handling                                   <-
###### 7.removing numbers and special characters
###### 8.tokenizing and joining                            <-

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk.tokenize import WordPunctTokenizer
from bs4 import BeautifulSoup
import re

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

cols = ['sentiment','id','date','query_string','user','text']
df = pd.read_csv("./training.1600000.processed.noemoticon.csv",header=None, names=cols, encoding="ISO-8859-1")

In [4]:
tok = WordPunctTokenizer()
pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
pat3 = r'#[A-Za-z0-9]+'
combined_pat = r'|'.join((pat1, pat2, pat3))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}

# negation_dic 뿐만 아니라 아래와 같이 축약해서 표현하는 것도 캐치 해야해.
# 뭐라 부르는지 잘 모르겠는데 우선 흐름을 공부하고 나중에 advanced 시키자.
# _dic = {"it's":"it is", "they're":"they are"...}

neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

In [16]:
for i in range(0,4):
    print(i)

0
1
2
3


In [17]:
def tweet_cleaner_updated(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.replace(u"ï¿½", "?")
    except:
        bom_removed = souped
    stripped = re.sub(combined_pat, '', bom_removed)
    stripped = re.sub(www_pat, '', stripped)
    lower_case = stripped.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    
    number_recognizer = re.sub("[0-9]+", "#", neg_handled)
    letters_only = re.sub("[^a-zA-Z]", " ", number_recognizer)
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

In [19]:
nums = [0,400000,800000,1200000,1600000]
print( "Cleaning and parsing the tweets...\n")
clean_tweet_texts = []

for j in range(0, len(nums)-1):
    for i in range(nums[j],nums[j+1]):
        if( (i+1)%10000 == 0 ):
            print( "Tweets %d of %d has been processed" % ( i+1, nums[j+1] )                                                                    )
        clean_tweet_texts.append(tweet_cleaner_updated(df['text'][i]))
    clean_df = pd.DataFrame(clean_tweet_texts,columns=['text'])
    clean_df['target'] = df.sentiment
    clean_df.to_csv('clean_tweet_updated'+ str(j) +'.csv',encoding='utf-8')
    print("----------------------------------------------------")

Cleaning and parsing the tweets...

Tweets 10000 of 400000 has been processed
Tweets 20000 of 400000 has been processed
Tweets 30000 of 400000 has been processed
Tweets 40000 of 400000 has been processed
Tweets 50000 of 400000 has been processed
Tweets 60000 of 400000 has been processed
Tweets 70000 of 400000 has been processed
Tweets 80000 of 400000 has been processed
Tweets 90000 of 400000 has been processed
Tweets 100000 of 400000 has been processed
Tweets 110000 of 400000 has been processed
Tweets 120000 of 400000 has been processed
Tweets 130000 of 400000 has been processed
Tweets 140000 of 400000 has been processed
Tweets 150000 of 400000 has been processed
Tweets 160000 of 400000 has been processed
Tweets 170000 of 400000 has been processed
Tweets 180000 of 400000 has been processed
Tweets 190000 of 400000 has been processed
Tweets 200000 of 400000 has been processed
Tweets 210000 of 400000 has been processed
Tweets 220000 of 400000 has been processed
Tweets 230000 of 400000 has

  ' Beautiful Soup.' % markup)


Tweets 770000 of 800000 has been processed
Tweets 780000 of 800000 has been processed
Tweets 790000 of 800000 has been processed
Tweets 800000 of 800000 has been processed
----------------------------------------------------
Tweets 810000 of 1200000 has been processed
Tweets 820000 of 1200000 has been processed
Tweets 830000 of 1200000 has been processed
Tweets 840000 of 1200000 has been processed
Tweets 850000 of 1200000 has been processed
Tweets 860000 of 1200000 has been processed
Tweets 870000 of 1200000 has been processed
Tweets 880000 of 1200000 has been processed
Tweets 890000 of 1200000 has been processed
Tweets 900000 of 1200000 has been processed
Tweets 910000 of 1200000 has been processed
Tweets 920000 of 1200000 has been processed
Tweets 930000 of 1200000 has been processed
Tweets 940000 of 1200000 has been processed
Tweets 950000 of 1200000 has been processed
Tweets 960000 of 1200000 has been processed
Tweets 970000 of 1200000 has been processed
Tweets 980000 of 1200000 ha

  ' Beautiful Soup.' % markup)


Tweets 1300000 of 1600000 has been processed
Tweets 1310000 of 1600000 has been processed
Tweets 1320000 of 1600000 has been processed
Tweets 1330000 of 1600000 has been processed
Tweets 1340000 of 1600000 has been processed
Tweets 1350000 of 1600000 has been processed
Tweets 1360000 of 1600000 has been processed
Tweets 1370000 of 1600000 has been processed
Tweets 1380000 of 1600000 has been processed
Tweets 1390000 of 1600000 has been processed
Tweets 1400000 of 1600000 has been processed
Tweets 1410000 of 1600000 has been processed
Tweets 1420000 of 1600000 has been processed
Tweets 1430000 of 1600000 has been processed
Tweets 1440000 of 1600000 has been processed
Tweets 1450000 of 1600000 has been processed
Tweets 1460000 of 1600000 has been processed
Tweets 1470000 of 1600000 has been processed
Tweets 1480000 of 1600000 has been processed
Tweets 1490000 of 1600000 has been processed
Tweets 1500000 of 1600000 has been processed
Tweets 1510000 of 1600000 has been processed
Tweets 152

In [22]:
csv ='./03_result/clean_tweet_updated1.csv'
my_df = pd.read_csv(csv,index_col = 0)
my_df.head()

Unnamed: 0,text,target
0,awww that bummer you shoulda got david carr of...,0
1,is upset that he can not update his facebook b...,0
2,dived many times for the ball managed to save ...,0
3,my whole body feels itchy and like its on fire,0
4,no it not behaving at all mad why am here beca...,0


In [23]:
my_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800000 entries, 0 to 799999
Data columns (total 2 columns):
text      798153 non-null object
target    800000 non-null int64
dtypes: int64(1), object(1)
memory usage: 18.3+ MB


In [31]:
my_df[my_df.isnull().any(axis=1)].count()

text         0
target    1847
dtype: int64

### By looking these entries in the original data, it seems like only text information they had was either twitter ID or url address. Anyway, these are the info I decided to discard for the sentiment analysis, so I will drop these null rows, and update the data frame.

In [32]:
my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
my_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 798153 entries, 0 to 798152
Data columns (total 2 columns):
text      798153 non-null object
target    798153 non-null int64
dtypes: int64(1), object(1)
memory usage: 12.2+ MB
