# Тема “Предобработка текста с помощью Python”
Осуществим предобработку данных с Твиттера, чтобы очищенные данные в дальнейшем
использовать для задачи классификации. Данный датасет содержит негативные (label = 1)
и нейтральные (label = 0) высказывания. Для работы объединим train_df и test_df.

In [None]:
import re
import numpy as np 
import pandas as pd
from pathlib import Path
import string

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

import nltk
from nltk import tokenize as tknz
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
DATA_ROOT = Path('/content/drive/Othercomputers/Мое устройство Компьютер/Google.Disk/Colab Notebooks/data/')
TRAIN_PATH = DATA_ROOT / 'train_tweets.csv'
#TEST_PATH = DATA_ROOT / 'test_tweets.csv'

In [None]:
df_train = pd.read_csv(TRAIN_PATH, sep=',')
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [None]:
df_train.head(3)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty


Задания:
1. У далим @user из всех твитов с помощью паттерна "@[\w]*". 

Для этого создадим функцию:
- для того, чтобы найти все вхождения паттерна в тексте, необходимо
использовать re.findall(pattern, input_txt)
-для замены @user на пробел, необходимо использовать re.sub()


In [None]:
df_train['tweet'] = df_train[['tweet']].applymap(lambda x: re.sub(r'@[\w]*', ' ', x))
df_train[['tweet']].head()

Unnamed: 0,tweet
0,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,bihday your majesty
3,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,factsguide: society now #motivation


2. Изменим регистр твитов на нижний с помощью .lower().

In [None]:
df_train['tweet'] = df_train[['tweet']].applymap(lambda x: x.lower())
df_train[['tweet']].head()

Unnamed: 0,tweet
0,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,bihday your majesty
3,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,factsguide: society now #motivation


3. Заменим сокращения с апострофами (пример: ain't, can't) на пробел, используя
apostrophe_dict. 

Для этого необходимо сделать функцию: 
- для каждого слова в тексте проверить (for word in text.split()), если слово есть в словаре apostrophe_dict в качестве ключа (сокращенного слова), то заменить ключ на значение (полную
версию слова).

4. Заменим сокращения на их полные формы, используя short_word_dict. Для этого
воспользуемся функцией, используемой в предыдущем пункте.

5. Заменим эмотиконы (пример: ":)" = "happy") на пробелы, используя emoticon_dict.
Для этого воспользуемся функцией, используемой в предыдущем пункте.

6. Заменим пунктуацию на пробелы, используя re.sub() и паттерн r'[^\w\s]'.

In [None]:
df_train['tweet'] = df_train[['tweet']].applymap(lambda x: re.sub(r'[^\w\s]', ' ', x))
df_train[['tweet']].head()

Unnamed: 0,tweet
0,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run
1,thanks for lyft credit i can t use cause they don t offer wheelchair vans in pdx disapointed getthanked
2,bihday your majesty
3,model i love u take with u all the time in urð ð ð ð ð ð ð ð
4,factsguide society now motivation


7. Заменим спец. символы на пробелы, используя re.sub() и паттерн r'[^a-zA-Z0-9]'.

In [None]:
df_train['tweet'] = df_train[['tweet']].applymap(lambda x: re.sub(r'[^a-zA-Z0-9]', ' ', x))
df_train[['tweet']].head()

Unnamed: 0,tweet
0,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run
1,thanks for lyft credit i can t use cause they don t offer wheelchair vans in pdx disapointed getthanked
2,bihday your majesty
3,model i love u take with u all the time in ur
4,factsguide society now motivation


8. Заменим числа на пробелы, используя re.sub() и паттерн r'[^a-zA-Z]'.

In [None]:
df_train['tweet'] = df_train[['tweet']].applymap(lambda x: re.sub(r'[^a-zA-Z]', ' ', x))
df_train[['tweet']].head()

Unnamed: 0,tweet
0,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run
1,thanks for lyft credit i can t use cause they don t offer wheelchair vans in pdx disapointed getthanked
2,bihday your majesty
3,model i love u take with u all the time in ur
4,factsguide society now motivation


9. У далим из текста слова длиной в 1 символ, используя ' '.join([w for w in x.split() if
len(w)>1]).

In [None]:
df_train['tweet'] = df_train[['tweet']].applymap(lambda x: ' '.join([w for w in x.split() if len(w)>1]))
df_train[['tweet']].head()

Unnamed: 0,tweet
0,when father is dysfunctional and is so selfish he drags his kids into his dysfunction run
1,thanks for lyft credit can use cause they don offer wheelchair vans in pdx disapointed getthanked
2,bihday your majesty
3,model love take with all the time in ur
4,factsguide society now motivation


10. Поделим твиты на токены с помощью nltk.tokenize.word_tokenize, создав новый
столбец 'tweet_token'.

In [None]:
df_train['tweet_token'] = df_train[['tweet']].applymap(lambda x: tknz.word_tokenize(x))
df_train[['tweet', 'tweet_token']].head()

Unnamed: 0,tweet,tweet_token
0,when father is dysfunctional and is so selfish he drags his kids into his dysfunction run,"[when, father, is, dysfunctional, and, is, so, selfish, he, drags, his, kids, into, his, dysfunction, run]"
1,thanks for lyft credit can use cause they don offer wheelchair vans in pdx disapointed getthanked,"[thanks, for, lyft, credit, can, use, cause, they, don, offer, wheelchair, vans, in, pdx, disapointed, getthanked]"
2,bihday your majesty,"[bihday, your, majesty]"
3,model love take with all the time in ur,"[model, love, take, with, all, the, time, in, ur]"
4,factsguide society now motivation,"[factsguide, society, now, motivation]"


11. У далим стоп-слова из токенов, используя nltk.corpus.stopwords. Создадим столбец
'tweet_token_filtered' без стоп-слов.

In [None]:
stop_words = stopwords.words('english')

df_train['tweet_token_filtered'] = df_train[['tweet_token']].applymap(lambda x: [w for w in x if w not in stop_words])

df_train[['tweet', 'tweet_token', 'tweet_token_filtered']].head()

Unnamed: 0,tweet,tweet_token,tweet_token_filtered
0,when father is dysfunctional and is so selfish he drags his kids into his dysfunction run,"[when, father, is, dysfunctional, and, is, so, selfish, he, drags, his, kids, into, his, dysfunction, run]","[father, dysfunctional, selfish, drags, kids, dysfunction, run]"
1,thanks for lyft credit can use cause they don offer wheelchair vans in pdx disapointed getthanked,"[thanks, for, lyft, credit, can, use, cause, they, don, offer, wheelchair, vans, in, pdx, disapointed, getthanked]","[thanks, lyft, credit, use, cause, offer, wheelchair, vans, pdx, disapointed, getthanked]"
2,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]"
3,model love take with all the time in ur,"[model, love, take, with, all, the, time, in, ur]","[model, love, take, time, ur]"
4,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


12. Применим стемминг к токенам с помощью nltk.stem.PorterStemmer. Создадим
столбец 'tweet_stemmed' после применения стемминга.

In [None]:
stemmer = PorterStemmer()
df_train['tweet_stemmed'] = df_train[['tweet_token_filtered']].applymap(lambda x: [stemmer.stem(w) for w in x])

df_train[['tweet', 'tweet_token', 'tweet_token_filtered', 'tweet_stemmed']].head()

Unnamed: 0,tweet,tweet_token,tweet_token_filtered,tweet_stemmed
0,when father is dysfunctional and is so selfish he drags his kids into his dysfunction run,"[when, father, is, dysfunctional, and, is, so, selfish, he, drags, his, kids, into, his, dysfunction, run]","[father, dysfunctional, selfish, drags, kids, dysfunction, run]","[father, dysfunct, selfish, drag, kid, dysfunct, run]"
1,thanks for lyft credit can use cause they don offer wheelchair vans in pdx disapointed getthanked,"[thanks, for, lyft, credit, can, use, cause, they, don, offer, wheelchair, vans, in, pdx, disapointed, getthanked]","[thanks, lyft, credit, use, cause, offer, wheelchair, vans, pdx, disapointed, getthanked]","[thank, lyft, credit, use, caus, offer, wheelchair, van, pdx, disapoint, getthank]"
2,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesti]"
3,model love take with all the time in ur,"[model, love, take, with, all, the, time, in, ur]","[model, love, take, time, ur]","[model, love, take, time, ur]"
4,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguid, societi, motiv]"


13. Применим лемматизацию к токенам с помощью
nltk.stem.wordnet.WordNetLemmatizer . Создадим столбец 'tweet_lemmatized' после применения лемматизации.

In [None]:
lemmatizer = WordNetLemmatizer()
df_train['tweet_lemmatized'] = df_train[['tweet_token_filtered']].applymap(lambda x: [lemmatizer.lemmatize(w, wordnet.VERB) for w in x])

df_train[['tweet', 'tweet_token', 'tweet_token_filtered', 'tweet_stemmed', 'tweet_lemmatized']].head()

Unnamed: 0,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,when father is dysfunctional and is so selfish he drags his kids into his dysfunction run,"[when, father, is, dysfunctional, and, is, so, selfish, he, drags, his, kids, into, his, dysfunction, run]","[father, dysfunctional, selfish, drags, kids, dysfunction, run]","[father, dysfunct, selfish, drag, kid, dysfunct, run]","[father, dysfunctional, selfish, drag, kid, dysfunction, run]"
1,thanks for lyft credit can use cause they don offer wheelchair vans in pdx disapointed getthanked,"[thanks, for, lyft, credit, can, use, cause, they, don, offer, wheelchair, vans, in, pdx, disapointed, getthanked]","[thanks, lyft, credit, use, cause, offer, wheelchair, vans, pdx, disapointed, getthanked]","[thank, lyft, credit, use, caus, offer, wheelchair, van, pdx, disapoint, getthank]","[thank, lyft, credit, use, cause, offer, wheelchair, vans, pdx, disapointed, getthanked]"
2,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesti]","[bihday, majesty]"
3,model love take with all the time in ur,"[model, love, take, with, all, the, time, in, ur]","[model, love, take, time, ur]","[model, love, take, time, ur]","[model, love, take, time, ur]"
4,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguid, societi, motiv]","[factsguide, society, motivation]"


14. Сохраним резуль тат предобработки в pickle-файл.

In [None]:
df_train.to_csv( DATA_ROOT /'train_tweets_token.csv', index=False)