# # TEXT PROCESSING

Text Processing

Text Processing merupakan bagian dari Machine Learning. 

Author Name Disambiguation merupakan salah satu bagian dari text processing. Author Name Disambiguation ialah proses membedakan penulis yang memiliki kesamaan nama (ambigu) sehingga dapat mengurangi kesalahan dalam mengidentifikasi data penulis.

**Library yang digunakan**

In [1]:
import numpy as np
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer, LancasterStemmer, SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import string
import nltk
from nltk.corpus import stopwords, wordnet
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [11]:
df = pd.read_csv("robotics.csv")

**Melihat Isi Dataset**

In [12]:
df

Unnamed: 0,id,title,content,tags
0,1,What is the right approach to write the spin c...,<p>Imagine programming a 3 wheel soccer robot....,soccer control
1,2,How can I modify a low cost hobby servo to run...,"<p>I've got some hobby servos (<a href=""http:/...",control rcservo
2,3,What useful gaits exist for a six legged robot...,"<p><a href=""http://www.oricomtech.com/projects...",gait walk
3,4,Good Microcontrollers/SOCs for a Robotics Project,<p>I am looking for a starting point for my pr...,microcontroller arduino raspberry-pi
4,5,Nearest-neighbor data structure for non-Euclid...,<p>I'm trying to implement a nearest-neighbor ...,motion-planning rrt
...,...,...,...,...
2766,10568,What types of actuators do these industrial bo...,<p>I have a particular example robot that inte...,motor robotic-arm actuator torque
2767,10573,Technique to increase POV resolution,<p>I have thought of a technique to increase t...,microcontroller electronics
2768,10580,How can I upload sketches to an Arduino over a...,<p>I am doing robotics project on Raspberry pi...,arduino raspberry-pi embedded-systems first-ro...
2769,10581,EKF SLAM 2d laser scanned datasets usage,<p>How to understand the 2d laser scanner scan...,slam ekf first-robotics


**Penjelasan Dataset**

1. dataset : judul dataset.
2. id : id paper/karya tulis penulis.
3. title : judul karya tulis.
4. venue : jurnal yang menerbitkan paper/karya tulis penulis.
5. year : tahun paper/karya tulis dipublish.
6. name : berisi mengenai daftar nama penulis.
7. shortname : daftar nama pendek penulis.
8. id2 : id tanda pengenal penulis.



In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2771 entries, 0 to 2770
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       2771 non-null   int64 
 1   title    2771 non-null   object
 2   content  2771 non-null   object
 3   tags     2771 non-null   object
dtypes: int64(1), object(3)
memory usage: 86.7+ KB


**Melakukan Teks Pre-Processing**

Text Preprocessing adalah salah satu bagian dari Natural Language Processing (NLP) dan merupakan tahapan paling awal sebelum masuk ke dalam proses inti pengolahan data. Text Preprocessing berfungsi dalam mempersiapkan data yang digunakan agar lebih terstruktur dan efektif. NLP merupakan teknologi Artificial Intelligence (AI) yang digunakan untuk mengubah data inti dari suatu dokumen berbentuk teks (text form) menjadi suatu data kuantitatif berbentuk angka (numerical form) secara cepat. Sehingga, data dapat diproses lebih lanjut seperti, dilakukan klasifikasi,clustering, dan sebagainya.


In [15]:
title = df['title']
print(title)

0       What is the right approach to write the spin c...
1       How can I modify a low cost hobby servo to run...
2       What useful gaits exist for a six legged robot...
3       Good Microcontrollers/SOCs for a Robotics Project
4       Nearest-neighbor data structure for non-Euclid...
                              ...                        
2766    What types of actuators do these industrial bo...
2767                 Technique to increase POV resolution
2768    How can I upload sketches to an Arduino over a...
2769             EKF SLAM 2d laser scanned datasets usage
2770                      Communication in SWARM robotics
Name: title, Length: 2771, dtype: object


**Function Teks Pre Processing**

In [16]:
def clean(text):
    text = text.lower()  # Lowercase text
    text = re.sub(r'\d+', '', text.lower())  # Clear Number
    text = text.translate(str.maketrans("", "", string.punctuation))  # Clear  !@#$##&$*(%)
    text = text.strip()  # Bersihkan whitespaces (karakter karakter yang aneh/tak terlihat)
    return text

**Penerapan/Apply Fungsi pada salah satu feature**

In [18]:
title = title.apply(clean)
title

0       what is the right approach to write the spin c...
1       how can i modify a low cost hobby servo to run...
2       what useful gaits exist for a six legged robot...
3        good microcontrollerssocs for a robotics project
4       nearestneighbor data structure for noneuclidea...
                              ...                        
2766    what types of actuators do these industrial bo...
2767                 technique to increase pov resolution
2768    how can i upload sketches to an arduino over a...
2769              ekf slam d laser scanned datasets usage
2770                      communication in swarm robotics
Name: title, Length: 2771, dtype: object

In [19]:
pip install nltk 




In [20]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

In [21]:
def tokenstem(text):
    words = word_tokenize(text)
    filtered = []
    for w in words:
        if w not in stop_words:
            w = lemmatizer.lemmatize(w)
            w = stemmer.stem(w)
            filtered.append(w)
    return filtered

In [22]:
title = title.apply(tokenstem)
title

0       [right, approach, write, spin, control, soccer...
1          [modifi, low, cost, hobbi, servo, run, freeli]
2           [use, gait, exist, six, leg, robot, pro, con]
3             [good, microcontrollerssoc, robot, project]
4       [nearestneighbor, data, structur, noneuclidean...
                              ...                        
2766                   [type, actuat, industri, bot, use]
2767                    [techniqu, increas, pov, resolut]
2768             [upload, sketch, arduino, raspberri, pi]
2769              [ekf, slam, laser, scan, dataset, usag]
2770                               [commun, swarm, robot]
Name: title, Length: 2771, dtype: object

**Penjelasan**

Lowercase = Fungsi yang mengubah huruf kapital menjadi huruf kecil.

Regular Expression = Fungsi yang digunakan untuk menghapus angka.

Punctuation = Fungsi yang digunakan untuk menghapus tanda baca.

Fungsi strip, digunakan untuk menghapus white space yaitu spasi di awal ataupun di akhir kalimat.

Tokenize = Proses pemisahan menjadi potongan kata.

Lemmatizer = Menghilangkan akhiran infleksi dari suatu kata dan mengembalikan kata tersebut menjadi kata dasarnya, sebagai contohnya kata runs, running, dan ran akan diubah menjadi kata dasarnya yaitu run.

Stemming = Tahapan pencarian dan pengubahan kata dasar dari tiap kata hasil Filtering.

Stop Words = Kata umum yang biasanya muncul dalam jumlah besar dan dianggap tidak memiliki makna. Seperti, 'the'.