**Crawling data** merupakan sebuah teknik pengumpulan data dimana data tidak ditulis secara manual melainkan menggunakan mesin. Salah satu library yang dapat digunakan untuk melakukan crawling adalah dengan menggunakan twint. twint merupakan libraby yang dapat digunakan untuk melakukan crawling data pada twitter. adapun persiapan yang harus dilakukan terlebih dahulu adalah menginstall jupyter-book sebagai tools untuk membuat konten.

Pertama-tama masuk ke folder webmining terlebih dahulu

Kemudian install twin dari github dengan code seperti berikut:

In [None]:
!pip install git+https://github.com/twintproject/twint.git

Setelah itu install juga nest asynco

In [None]:
!pip install nest_asyncio

In [None]:
import twint
import nest_asyncio
nest_asyncio.apply() #digunakan sekali untuk mengaktifkan tindakan serentak dalam notebook jupyter.

Dalam melakukan crawl menggunakan **twint** ada beberapa baris code yang dituliskan. Dibawah ini merupakan code untuk melakukan crawl menggunakan twint:

In [None]:
import twint
c = twint.Config()

c.Since = '2022-09-18'
c.Until = '2022-09-19'
c.Search = 'ospek'
c.Output = 'dataOspek.csv'

twint.run.Search(c)

Setelah data dicrawling selanjutnya dilakukan pra preprocessing atau bisa juga disebut **data cleaning**. Apa itu **data cleaning** ? data sudah bersih yang dimana pada data tersebut tidak terdapat url, emoji, backslice dan yang lainnya. Tujuan dari data cleaning yaitu agar data yang masuk kedalam mesin itu murni berupa data teks sehingga mesin lebih mudah dalam melakukan pembelajaran.

In [None]:
import pandas as pd
import numpy as np
data_tweet = pd.read_csv('dataospektweet.csv') #membaca file csv dengan nama output
data_tweet.head()

Selanjutnya adalah mengganti data index 1 secara manual untuk menghapus emoji karena emoji tersebut tidak valid sehingga tidak bisa dihapus menggunakan mesin

In [None]:
data_tweet["tweet"][1] = "Cuz i don't have a proper weekend...   Terus nyadar kalau minggu depan jg ada ospek offline"
data_tweet["tweet"][1]

**Case Folding**

In [None]:
data_tweet["tweet"] = data_tweet["tweet"].str.lower()
data_tweet["tweet"]

**Cleaning Data**

In [None]:
import nltk
import string
import re
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

Yang pertama adalah menghapus url

In [None]:
def remove_url(text):
  return re.sub(r'http\S+', '', text)
data_tweet["tweet"] = data_tweet["tweet"].apply(remove_url)

Selanjutnya adalah menghapus karekter spesial

In [None]:
def remove_tweet_special(text):
  #remove tab, new line, and back slice
  text = text.replace('\\t'," ").replace('\\n'," ").replace('\\u'," ").replace('\\',"")
  #remove non ASCII (emoticon,chinese word,.etc)
  text = text.encode('ascii','replace').decode('ascii')
  #remove mention, link, hastag
  text = ' '.join(re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\s+)", " ", text).split())
  #remove incomplete url
  return text.replace("http://"," ").replace("https://"," ")
data_tweet["tweet"] = data_tweet["tweet"].apply(remove_tweet_special)

In [None]:
def remove_number(text):
  return re.sub(r"\d+", "",text)
data_tweet["tweet"] = data_tweet["tweet"].apply(remove_number)

In [None]:
def remove_punctuation(text):
  return text.translate(str.maketrans("","",string.punctuation))
data_tweet["tweet"] = data_tweet["tweet"].apply(remove_punctuation)

In [None]:
def remove_whitespace(text):
  return text.strip()
data_tweet["tweet"] = data_tweet["tweet"].apply(remove_whitespace)

In [None]:
def remove_singl_char(text):
  return re.sub(r"\b[a-zA-Z]\b","",text)
data_tweet["tweet"] = data_tweet["tweet"].apply(remove_singl_char)

In [None]:
def word_tokenize_wrapper(text):
  return word_tokenize(text)
data_tweet["tweet_tokens"] = data_tweet["tweet"].apply(word_tokenize_wrapper)

In [None]:
data_tweet["tweet"].head()

In [None]:
data_tweet["tweet_tokens"].head()

In [None]:
def freqDist_wrapper(text):
  return FreqDist(text)

data_tweet["tweet_tokens_fdist"] = data_tweet["tweet_tokens"].apply(freqDist_wrapper)
data_tweet["tweet_tokens_fdist"].head()

**Stopwrord Removal**

In [None]:
#import nltk untuk melakukan peprocessing data
import nltk
nltk.download('stopwords') #download stopword indonesia
from nltk.corpus import stopwords
stopword_language = stopwords.words('indonesian') #set stopword
stopword_language.extend(['jg','jos','yg','dg','dgn','rt','ny','d','klo','kalo',
                          'amp','biar','bikin','bilang','gak','ga','krn','nya','nih',
                          'sih','si','tau','tdk','tuh','utk','ya','jd','jgn','sdh','aja',
                          'n','t',
                          ])
txt_stopword = pd.read_csv('dataospektweet.csv',names=['stopwords'],header=None)
stopword_language.extend(txt_stopword['stopwords'][0].split(' '))

In [None]:
def stopword_removal(words):
  return [word for word in words if word not in stopword_language]

data_tweet["tweet_tokens_rstopword"] = data_tweet["tweet_tokens"].apply(stopword_removal)

**Normalisasi**

In [None]:
normalized_word = pd.read_csv('dataospektweet.csv')
normalized_word_dict = {}

for index,row in normalized_word_dict:
  if row[0] not in normalized_word_dict:
    normalized_word_dict[row[0]]=row[1]

def normalized_term(document):
  return [normalized_word_dict[term] if term in normalized_word_dict else term for term in document]

data_tweet["tweet_normalized"] = data_tweet["tweet_tokens_rstopword"].apply(normalized_term)
data_tweet["tweet_normalized"].head()

**Stemming**

In [None]:
!pip install sastrawi

In [None]:
!pip install swifter

In [None]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import swifter

factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stemmed_wrapper(term):
  return stemmer.stem(term)

term_dict = {}

for document in data_tweet['tweet_normalized']:
  for term in document:
    if term not in term_dict:
      term_dict[term] = ' '

print(len(term_dict))
print("------------------------")

for term in term_dict:
  term_dict[term] = stemmed_wrapper(term)
  print(term,':',term_dict[term])

print(term_dict)
print("------------------------")

In [None]:
def get_stemmed_term(document):
  return [term_dict[term] for term in document]

data_tweet["tweet_tokens_stemmed"] = data_tweet["tweet_normalized"].swifter.apply(get_stemmed_term)

**Membuat VSM**

In [None]:
data_tweet.to_csv("preprocessing_ospek.csv")

In [None]:
!pip install sklearn

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
titles = pd.read_csv('preprocessing_ospek_sample.csv',sep=',',usecols=["tweet_tokens_stemmed"],squeeze=True)

docs = titles.values
bag = count.fit_transform(docs)
print(docs)

In [None]:
print(count.vocabulary_)

In [None]:
print(count.get_feature_names())
a = count.get_feature_names()

In [None]:
print(bag.toarray())
b = bag.toarray()

In [None]:
dfa = pd.DataFrame(data=a)
num_rows = -1

for row in open("preprocessing_ospek_sample.csv"):
  num_rows+=1

dfb = pd.DataFrame(data=b,index=range(0,num_rows),columns=[a])
dfb

In [None]:
dflabels = pd.read_csv('labels.csv')
dfb['labels'] = dflabels #menambahkan kolom labels dari file labels.csv
dfb

In [None]:
#import pembagian train dan test dari sklearn model selection
from sklearn.model_selection import train_test_split
#import mutual information untuk klasifikasi
from sklearn.feature_selection import mutual_info_classif
#set data training dan data testing dari data
X_train,X_test,y_train,y_test=train_test_split(dfb.drop(labels=['labels'], axis=1),
    dfb['labels'],
    test_size=0.3,
    random_state=0)
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info

**Sort**

In [None]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)