# **Metin Ön İşleme**
## Verileri temizlemek ve hazırlamak için çeşitli ön işleme adımlarından oluşur.
### Makine öğrenimi görevlerinde verilerin temizlenmesi ve ön işlenmesi çok önemli bir adımdır. Verilerimizi ne kadar iyi temsil edebilirsek, model eğitimi ve tahmini de o kadar iyi beklenebilir.

In [None]:
# Temel Kütüphaneleri yükleme
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string


from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup


pd.options.display.max_columns=None
pd.options.display.max_rows=None
pd.options.display.max_colwidth=None

# Örnek bir veri seti yükleme
nltk.download('twitter_samples')

# Her biri 5000 tweet içeren Olumsuz ve Olumlu Tweetler dosyasını kullanacağız.
for name in twitter_samples.fileids():
    print(f' - {name}')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


 - negative_tweets.json
 - positive_tweets.json
 - tweets.20150430-223406.json


## Metin içerisindeki emojileri doğru şekilde kaldırmak ve değiştirmek için

In [None]:
pip install demoji

Collecting demoji
  Downloading demoji-1.1.0-py3-none-any.whl (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.9/42.9 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: demoji
Successfully installed demoji-1.1.0


In [None]:
import demoji

In [None]:
# Negatif tweet dosyasını yükleyin ve negatif için etiketi 0 olarak atayın
negative_tweets = twitter_samples.strings("negative_tweets.json")
df_neg = pd.DataFrame(negative_tweets, columns=['text'])
df_neg['label'] = 0

# Pozitif tweetler dosyasını yükleyin ve pozitifler için etiketi 1 olarak atayın
positive_tweets = twitter_samples.strings("positive_tweets.json")
df_pos = pd.DataFrame(positive_tweets, columns=['text'])
df_pos['label'] = 1

df = pd.concat([df_pos, df_neg]) # Her iki dosyayı da birleştir
df = df.sample(frac=1).reset_index(drop=True) # Negatif ve pozitif tweetleri karıştırmak için verileri karıştırın

In [None]:
print(f'Veriler: {df.shape[0]} satır ve {df.shape[1]} sutun dan oluşturulmuştur')

Veriler: 10000 satır ve 2 sutun dan oluşturulmuştur


In [None]:
df.head()  # ilk beş örnek

Unnamed: 0,text,label
0,Found someone I met long ago in Malta to come to @tomorrowland with me with the extra ticket I had :) #GTO #Tomorrowland #incall,1
1,@Sibulela_M only white things I have ngeze turn up :( for all white parties I've been to. Have nothing cocktaily and classy 😭😭😭 so stressed.,0
2,"#Haaretz #Israel :-( Syria continues to develop chemical weapons, officials tell WSJ: WSJ rep... http://t.co/3c5PRCHKqw #UniteBlue #Tcot",0
3,More money more money :-(,0
4,@1031Genfmsby Visit my blog http://t.co/UzOAqroWKx thanks :D,1


### Lower Casing (Küçük harfe çevirme)
#### Tüm kopyaları bulmak için de faydalıdır, çünkü farklı durumlardaki kelimeler ayrı kelimeler olarak ele alınır ve tüm farklı durum kombinasyonlarında gereksiz kelimeleri kaldırmamızı zorlaştırır.

In [None]:
df.text = df.text.str.lower()
df.head(5)

Unnamed: 0,text,label
0,found someone i met long ago in malta to come to @tomorrowland with me with the extra ticket i had :) #gto #tomorrowland #incall,1
1,@sibulela_m only white things i have ngeze turn up :( for all white parties i've been to. have nothing cocktaily and classy 😭😭😭 so stressed.,0
2,"#haaretz #israel :-( syria continues to develop chemical weapons, officials tell wsj: wsj rep... http://t.co/3c5prchkqw #uniteblue #tcot",0
3,more money more money :-(,0
4,@1031genfmsby visit my blog http://t.co/uzoaqrowkx thanks :d,1


## Silme işlemleri

#### Sadece bir web sitesinin analizini yapıyorsak URL'leri saklamak gereklidir. Aksi takdirde herhangi bir bilgi vermezler. Böylece bunları metnimizden kaldırabiliriz.

In [None]:
df.text = df.text.str.replace(r'https?://\S+|www\.\S+', '', regex=True)
df.head()

Unnamed: 0,text,label
0,found someone i met long ago in malta to come to @tomorrowland with me with the extra ticket i had :) #gto #tomorrowland #incall,1
1,@sibulela_m only white things i have ngeze turn up :( for all white parties i've been to. have nothing cocktaily and classy 😭😭😭 so stressed.,0
2,"#haaretz #israel :-( syria continues to develop chemical weapons, officials tell wsj: wsj rep... #uniteblue #tcot",0
3,more money more money :-(,0
4,@1031genfmsby visit my blog thanks :d,1


#### E-posta kimlikleri müşteri geri bildirim verilerinde yaygındır ve herhangi bir yararlı bilgi sağlamazlar.

In [None]:
text = 'I have being trying to contact xyz via email to xyz@abc.co.in but there is no response.'
re.sub(r'\S+@\S+', '', text)

'I have being trying to contact xyz via email to  but there is no response.'

#### Tarihler çeşitli formatlarda gösterilebilir ve bazen bunları kaldırmak zor olabilir. Etiketleri tahmin etmek için herhangi bir yararlı bilgi sağlamazlar.


In [None]:
text = "Today is 22/12/2020 and after two days on 24-12-2020 our vacation starts until 25th.09.2021"

# 1. Remove date formats like: dd/mm/yy(yy), dd-mm-yy(yy), dd(st|nd|rd).mm/yy(yy)
re.sub(r'\d{1,2}(st|nd|rd|th)?[-./]\d{1,2}[-./]\d{2,4}', '', text)

'Today is  and after two days on  our vacation starts until '

#### Çeşitli web sitelerinden veri çıkarıyorsak, verilerin aynı zamanda HTML etiketleri de içermesi mümkündür. Bu etiketler herhangi bir bilgi sağlamaz ve kaldırılmalıdır. Bu etiketler regex veya BeautifulSoup kütüphanesi kullanılarak kaldırılabilir.

In [None]:
text = """
<title>Below is a dummy html code.</title>
<body>
    <p>All the html opening and closing brackets should be remove.</p>
    <a href="https://www.abc.com">Company Site</a>
</body>
"""
# regex kullanarak
pattern = re.compile('<.*?>')
pattern.sub('', text)

'\nBelow is a dummy html code.\n\n    All the html opening and closing brackets should be remove.\n    Company Site\n\n'

In [None]:
text = """
<title>Below is a dummy html code.</title>
<body>
    <p>All the html opening and closing brackets should be remove.</p>
    <a href="https://www.abc.com">Company Site</a>
</body>
"""
# Beautiful Soup kullanarak
def remove_html(text):
    clean_text = BeautifulSoup(text).get_text()
    return clean_text

remove_html(text)

'Below is a dummy html code.\n\nAll the html opening and closing brackets should be remove.\nCompany Site\n\n'

#### emoji silme işlemi

In [None]:
# Reference: https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b

def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

##### örnek

In [None]:
text = "game is on 🔥🔥. Hilarious😂"
remove_emoji(text)

'game is on . Hilarious'

In [None]:
# veri setinden emojileri silme
df.text = df.text.apply(lambda x: remove_emoji(x))

#### karakterler kullanılarak yapılan emojiler

In [None]:
EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‑c":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‑<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‑\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}

In [None]:
def remove_emoticons(text):
    emoticons_pattern = re.compile(u'(' + u'|'.join(emo for emo in EMOTICONS) + u')')
    return emoticons_pattern.sub(r'', text)

In [None]:
remove_emoticons("Hello :->")

'Hello '

In [None]:
# veri setinden emojileri silme
df.text = df.text.apply(lambda x: remove_emoticons(x))

### emojileri metne dönüştürme

In [None]:
demoji.download_codes()

  demoji.download_codes()


In [None]:
def emoji_to_words(text):
    return demoji.replace_with_desc(text, sep="__")

In [None]:
text = "game is on 🔥 🚣🏼"
emoji_to_words(text)

'game is on __fire__ __person rowing boat: medium-light skin tone__'

In [None]:
def emoticons_to_words(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").replace(":","").split()), text)
    return text

In [None]:
text = "Hey there!! :-)"
emoticons_to_words(text)

'Hey there!! Happy_face_smiley'

#### Tweetlerimizde anlamı belirtmek veya bir kişiye dikkat çekmek için hashtag'ler kullanılmaktadır. Hashtag'ler özellikleri çıkarmak, neyin trend olduğunu görmek ve diğer çeşitli uygulamalarda kullanılabilir. Bizim ihtayacımız olmadığı için çıkarıyoruz.

In [None]:
def remove_tags_mentions(text):
    pattern = re.compile(r'(@\S+|#\S+)')
    return pattern.sub('', text)

In [None]:
text = "live @flippinginja on #younow - jonah and jareddddd"
remove_tags_mentions(text)

'live  on  - jonah and jareddddd'

In [None]:
# veri setinden hastagları silme
df.text = df.text.apply(lambda x: remove_tags_mentions(x))

#### Noktalama işaretleri harfler ve rakamlardan farklı karakterlerdir. Bunlar arasında [!"#$%&'()*+,-./:;<=>?@\^_`{|}~] bulunur

Noktalama işaretlerini kaldırmadan önce ifadeleri kaldırmak veya dönüştürmek gerekebilir. Örneğin, metin 10,50 $ içeriyorsa, o zaman .(nokta)'yı çıkarınca  anlamını kaybeder.

In [None]:
PUNCTUATIONS = string.punctuation

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', PUNCTUATIONS))

In [None]:
df.text = df["text"].apply(lambda text: remove_punctuation(text))

#### Stopwords herhangi bir dilde yaygın olarak kullanılan kelimelerdir. Mesela İngilizce'de bu kelimeler 'the', 'a', 'an' ve çok daha fazlasıdır. Çoğu durumda yararlı değildirler ve kaldırılmaları gerekir.

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in STOPWORDS])

In [None]:
df.text = df.text.apply(lambda text: remove_stopwords(text))
df.head()

Unnamed: 0,text,label
0,found someone met long ago malta come extra ticket,1
1,white things ngeze turn white parties ive nothing cocktaily classy stressed,0
2,syria continues develop chemical weapons officials tell wsj wsj rep,0
3,money money,0
4,visit blog thanks,1


#### fazla boşlukları silme

In [None]:
def remove_whitespaces(text):
    return " ".join(text.split())

In [None]:
text = "  Whitespaces in the beginning are removed  \t as well \n  as in between  the text   "

clean_text = " ".join(text.split())
clean_text

'Whitespaces in the beginning are removed as well as in between the text'

In [None]:
df.text = df.text.apply(lambda x: remove_whitespaces(x))

### Stemming
#### Kök çıkarmada, kelimedeki ek karakterleri çıkararak kelimeyi temel veya kök şekline indirgemekteyiz.




In [None]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_words(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

In [None]:
df['text_stemmed'] = df.text.apply(lambda text: stem_words(text))
df[['text', 'text_stemmed']].head()

Unnamed: 0,text,text_stemmed
0,found someone met long ago malta come extra ticket,found someon met long ago malta come extra ticket
1,white things ngeze turn white parties ive nothing cocktaily classy stressed,white thing ngeze turn white parti ive noth cocktaili classi stress
2,syria continues develop chemical weapons officials tell wsj wsj rep,syria continu develop chemic weapon offici tell wsj wsj rep
3,money money,money money
4,visit blog thanks,visit blog thank


### Lemmatization
#### Lemmatizasyon, kök çıkarma görevine benzer bir görevi yerine getirmeye çalışmaktadır. Lemmatizasyonlar kelimenin morfolojik analizini dikkate alır. Kelimeleri lemma adı verilen sözlük şekline indirgemeye çalışır.

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
lemmatizer = WordNetLemmatizer()

def text_lemmatize(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [None]:
df['text_lemmatized'] = df.text.apply(lambda text: text_lemmatize(text))
df[['text', 'text_stemmed', 'text_lemmatized']].head()

Unnamed: 0,text,text_stemmed,text_lemmatized
0,found someone met long ago malta come extra ticket,found someon met long ago malta come extra ticket,found someone met long ago malta come extra ticket
1,white things ngeze turn white parties ive nothing cocktaily classy stressed,white thing ngeze turn white parti ive noth cocktaili classi stress,white thing ngeze turn white party ive nothing cocktaily classy stressed
2,syria continues develop chemical weapons officials tell wsj wsj rep,syria continu develop chemic weapon offici tell wsj wsj rep,syria continues develop chemical weapon official tell wsj wsj rep
3,money money,money money,money money
4,visit blog thanks,visit blog thank,visit blog thanks


### Yazım hatası düzeltme


In [None]:
pip install pyspellchecker



In [None]:
from spellchecker import SpellChecker

In [None]:
spell = SpellChecker()

def correct_spelling(text):
    correct_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            correct_text.append(spell.correction(word))
        else:
            correct_text.append(word)
    return " ".join(correct_text)

In [None]:
text = "Hi, hwo are you doin? I'm good thnks for asking"
correct_spelling(text)

"Hi, how are you doing I'm good thanks for asking"

In [None]:
text = "hw are you doin? I'm god thnks"
correct_spelling(text)

"he are you doing I'm god thanks"