# Sentiment Analysis 
Dalam Notebook ini akan dijelaskan beberapa tahapan sentiment analysis:
1. Preparation (load dataset, eksplorasi dataset) 
2. Data cleaning dan Pre-Processing
3. Feature Extraction <br>
(part dari notebook ini diambil dari https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90) 


## 1. Pre-paration and first look at the data
Kita akan me-load dataset dan melakukan eksplorasi terhadap dataset. <br>
Download dataset terlebih dahulu di link berikut cs.stanford.edu/people/alecmgo/trainingandtestdata.zip , kemudian ekstrak. <br>
Ada dua file yaitu file training dan file test.

In [72]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
cols = ['sentiment','id','date','query_string','user','text']
df_training = pd.read_csv(r"D:\Phyton\BigData\trainingandtestdata\training.1600000.processed.noemoticon.csv",header=None, names=cols, encoding="ISO-8859-1")
# above line will be different depending on where you saved your data, and your file name
df_training.head()

Unnamed: 0,sentiment,id,date,query_string,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [73]:
df_training.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
sentiment       1600000 non-null int64
id              1600000 non-null int64
date            1600000 non-null object
query_string    1600000 non-null object
user            1600000 non-null object
text            1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [74]:
df_training.sentiment.value_counts()

4    800000
0    800000
Name: sentiment, dtype: int64

Data terdiri 1.6 juta tweets yang terbagi menjadi 2 kelas yaitu kelas 4 (sentiment positive) dan kelas 0 (sentiment negative). <br>
Untuk mempermudah, kita mapping kelas 4 menjadi kelas 1

In [75]:
df_training['sentiment'] = df_training['sentiment'].map({0: 0, 4: 1})

In [76]:
df_training.sentiment.value_counts()

1    800000
0    800000
Name: sentiment, dtype: int64

## 2. Data Cleaning dan Pre-Processing
Sebelum kita menggunakan data, kita perlu melakukan data cleaning dan pre-processing terlebih dahulu. <br>
Untuk dataset yang kita punya, ada 5 proses data cleaning yang akan kita lakukan:
1. HTML Encoding
2. Menghilangkan @mention
3. Menghilangkan URL Links
4. Decoding text ke UTF-8
5. Menghilangkan hashtag
6. Meghilangkan stopwords

In [77]:
from bs4 import BeautifulSoup
import re
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
tok = WordPunctTokenizer()
pat1 = r'@[A-Za-z0-9]+'
pat2 = r'https?://[A-Za-z0-9./]+'
combined_pat = r'|'.join((pat1, pat2))
stop_words = set(stopwords.words('english')) 

def tweet_cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    stripped = re.sub(combined_pat, '', souped)
    try:
        clean = stripped.decode("utf-8").replace(u"\ufffd", "?")
    except:
        clean = stripped
    letters_only = re.sub("[^a-zA-Z]", " ", clean)
    lower_case = letters_only.lower()
    words = tok.tokenize(lower_case)
    filtered_words = [w for w in words if not w in stop_words] 
    return (" ".join(filtered_words)).strip()

In [85]:
%%time
print ("Cleaning and parsing the tweets...\n")
clean_tweet_texts = []
xrange = range
for i in xrange(len(df_training)):
    clean_tweet_texts.append(tweet_cleaner(df_training['text'][i]))
print ("Done!")

Cleaning and parsing the tweets...



  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


Done!
Wall time: 16min 12s


Sekarang coba cek bagaimana keadaan data setelah melalui tahap data cleaning

In [79]:
clean_df = pd.DataFrame(clean_tweet_texts,columns=['text'])
clean_df['target'] = df_training.sentiment
clean_df.head()

Unnamed: 0,text,target
0,,0
1,,0
2,,0
3,,0
4,,0


Setelah itu, kita save data yang sudah bersih pada file **clean_tweet_training.csv**

In [80]:
clean_df.to_csv('clean_tweet_training.csv',encoding='utf-8')

In [81]:
csv = 'clean_tweet_training.csv'
my_df = pd.read_csv(csv,index_col=0)
my_df.head()

  mask |= (ar1 == a)


Unnamed: 0,text,target
0,,0
1,,0
2,,0
3,,0
4,,0


Karena data cleaning dan pre-processing, terkadang ada data yang menjadi *null* <br>
Data yang kosong tidak akan kita gunakan.

In [82]:
tweet_text = my_df['text']
target = my_df['target']
fixed_text = tweet_text[pd.notnull(tweet_text)]
fixed_target = target[pd.notnull(tweet_text)]

## 3. Feature Extraction
Meng-ekstrak bag-of-words feature dari clean dataset

In [83]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(fixed_text)
X


ValueError: empty vocabulary; perhaps the documents only contain stop words

In [None]:
transformed = vectorizer.transform(["See what’s happening in the world right now"])
print (transformed)

In [None]:
vocab = vectorizer.vocabulary_
for v in transformed.indices:
    print vocab.keys()[vocab.values().index(v)]

### Lakukan feature extraction dengan menggunakan tfidfvectorizer!
### Kemudian lakukan step 1-3 untuk data testing!