# Text Processing
### Author: Ehsan Gharib-Nezhad


<!-- Let's review some of the pre-processing steps for text data:

- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

`CountVectorizer` actually can do a lot of this for us! It is important to keep these steps in mind in case you want to change the default methods used for each of these. -->

In [1]:
# Load Libraries
from myfunctions import *
from bs4 import BeautifulSoup #Function for removing html
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer


In [2]:
# Load datasets
df = pd.read_csv('../datasets/preprocessed_df_PandemicPreps_reddit_LAST.csv',index_col=0)

In [3]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,It's going fast in TN,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",PandemicPreps,1583021338,BeautifulBalance1,18,1,True,2020-02-29 16:08:58
1,Don’t forget about your pets!,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",PandemicPreps,1583021470,JKMSDE,12,1,True,2020-02-29 16:11:10
2,DIY Hand Sanitizer,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,PandemicPreps,1583022495,AccidentalDragon,44,1,True,2020-02-29 16:28:15
3,Anyone notice anything in the Denver / Boulder area? I see no concern and it concerns me!,What the hell is going on. Anyone around this area notice anything?,PandemicPreps,1583023089,Wizard_Knife_Fight,12,1,True,2020-02-29 16:38:09
4,"PSA: don’t only buy a bunch of random foods, know how to cook it and get the proper nutrition.","I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",PandemicPreps,1583025573,cutting-alumination,56,1,True,2020-02-29 17:19:33


### Data shape

In [4]:
df.shape

(2362, 9)

### Drop rows with selftext equal '[removed]'

In [5]:
# percentage of rows with "[removed]" word
print(f"percentage of rows with '[removed]' word: \
      {np.round(len(df[df['selftext']=='[removed]'])*100/len(df),2)}%")

percentage of rows with '[removed]' word:       0.0%


In [6]:
# remove all rows with selftext = "[removed]"
df.drop(index=df[df['selftext']=='[removed]'].index, inplace=True)

In [7]:
df.reset_index(inplace=True, drop=True)

In [8]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,It's going fast in TN,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",PandemicPreps,1583021338,BeautifulBalance1,18,1,True,2020-02-29 16:08:58
1,Don’t forget about your pets!,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",PandemicPreps,1583021470,JKMSDE,12,1,True,2020-02-29 16:11:10
2,DIY Hand Sanitizer,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,PandemicPreps,1583022495,AccidentalDragon,44,1,True,2020-02-29 16:28:15
3,Anyone notice anything in the Denver / Boulder area? I see no concern and it concerns me!,What the hell is going on. Anyone around this area notice anything?,PandemicPreps,1583023089,Wizard_Knife_Fight,12,1,True,2020-02-29 16:38:09
4,"PSA: don’t only buy a bunch of random foods, know how to cook it and get the proper nutrition.","I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",PandemicPreps,1583025573,cutting-alumination,56,1,True,2020-02-29 17:19:33


### Drop rows with nan in the selftext

In [9]:
# null percentage
df.isnull().sum()*100/len(df)

title           0.0
selftext        0.0
subreddit       0.0
created_utc     0.0
author          0.0
num_comments    0.0
score           0.0
is_self         0.0
timestamp       0.0
dtype: float64

In [10]:
#drop all rows with nulls
df.dropna(inplace=True)

In [11]:
# resetting the index
df.reset_index(inplace=True, drop = True)

In [12]:
# check for any remained nulls ?!
df.isna().sum()

title           0
selftext        0
subreddit       0
created_utc     0
author          0
num_comments    0
score           0
is_self         0
timestamp       0
dtype: int64

In [13]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,It's going fast in TN,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",PandemicPreps,1583021338,BeautifulBalance1,18,1,True,2020-02-29 16:08:58
1,Don’t forget about your pets!,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",PandemicPreps,1583021470,JKMSDE,12,1,True,2020-02-29 16:11:10
2,DIY Hand Sanitizer,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,PandemicPreps,1583022495,AccidentalDragon,44,1,True,2020-02-29 16:28:15
3,Anyone notice anything in the Denver / Boulder area? I see no concern and it concerns me!,What the hell is going on. Anyone around this area notice anything?,PandemicPreps,1583023089,Wizard_Knife_Fight,12,1,True,2020-02-29 16:38:09
4,"PSA: don’t only buy a bunch of random foods, know how to cook it and get the proper nutrition.","I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",PandemicPreps,1583025573,cutting-alumination,56,1,True,2020-02-29 17:19:33


In [14]:
df.shape

(2362, 9)

In [15]:
df['timestamp']

0       2020-02-29 16:08:58
1       2020-02-29 16:11:10
2       2020-02-29 16:28:15
3       2020-02-29 16:38:09
4       2020-02-29 17:19:33
               ...         
2357    2020-02-29 14:58:17
2358    2020-02-29 15:46:42
2359    2020-02-29 15:55:08
2360    2020-02-29 15:56:39
2361    2020-02-29 15:57:46
Name: timestamp, Length: 2362, dtype: object

### Lower Casing

In [16]:
df['post']  = df['selftext'].str.lower()

In [17]:
df['post']

0       anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't!  i took my under 10 aged child to walmart today and picked up some more to add, and it is a night and day difference from 3 days ago.  toilet paper gone.  paper towels gone.  lysol, down to 4 bottles.  hand sanitizer and alcohol, gone.  bleach was available but only a few rem...
1                                                                                                                                                                                                                                                            amazon is running low on cat and dog food, my normals are sold out. i feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.
2       because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pande

### Remove URL's / Website address

In [18]:
# Function for url's
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [19]:
df['post'] = df['post'].map( remove_urls )

### Removing special characters

In [20]:
df['post'] = df['post']\
                        .replace('http\S+', '', regex=True)\
                        .replace('www\S+', '', regex=True)\
                        .replace('\n\n\S+', '', regex=True)\
                        .replace('\n', '', regex=True)\
                        .replace('\*', '', regex=True)

In [21]:
df['post']

0       anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't!  i took my under 10 aged child to walmart today and picked up some more to add, and it is a night and day difference from 3 days ago.  toilet paper gone.  paper towels gone.  lysol, down to 4 bottles.  hand sanitizer and alcohol, gone.  bleach was available but only a few rem...
1                                                                                                                                                                                                                                                            amazon is running low on cat and dog food, my normals are sold out. i feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.
2                            because you just can't find it in stores any more... [  5 c 91% isopropyl alcohol 2 c aloe vera gel another receipt with essential oils [   1 tablespoon ru

### Find/Count emoji

In [22]:
import demoji

In [23]:
def find_emoji(dataframe, 
               print_option = False):
    if print_option == True:
        print ( dataframe[dataframe.map(demoji.findall) != {}])
    return (dataframe.map(demoji.findall) != {}).sum()

In [24]:
find_emoji(df['post'])

54

### Remove emoji

In [25]:
def remove_emoji(dataframe):
    return dataframe.map(demoji.replace)

In [26]:
df['post'] = remove_emoji(df['post'])

### Convert emoji to text
All emojis are removed fot the first part of the project which is distingushing two sub-redits. 
However, emojis are converted to text for sentiment analysis.

In [27]:
import emoji
def convert_emoji_to_text(text):
    return emoji.demojize(text)

In [28]:
# df['selftext'].iloc[0:10].map(convert_emoji_to_text)

### Removal of HTML tags

In [29]:
def remove_html(text):
    return BeautifulSoup(text, "lxml").text

In [30]:
df['post'] = df['post'].map(remove_html)

In [31]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp,post
0,It's going fast in TN,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",PandemicPreps,1583021338,BeautifulBalance1,18,1,True,2020-02-29 16:08:58,"anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! i took my under 10 aged child to walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. toilet paper gone. paper towels gone. lysol, down to 4 bottles. hand sanitizer and alcohol, gone. bleach was available but only a few rem..."
1,Don’t forget about your pets!,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",PandemicPreps,1583021470,JKMSDE,12,1,True,2020-02-29 16:11:10,"amazon is running low on cat and dog food, my normals are sold out. i feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter."
2,DIY Hand Sanitizer,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,PandemicPreps,1583022495,AccidentalDragon,44,1,True,2020-02-29 16:28:15,because you just can't find it in stores any more... [ 5 c 91% isopropyl alcohol 2 c aloe vera gel another receipt with essential oils [ 1 tablespoon rubbing alcohol or 2 tablespoons vodka 10 drops tea tree essential oil 10 drops lavender essential oil 1/4 cup aloe vera gel 1/2 teaspoon vitamin e oil (optional) a small bottle if i could just find some iso in the stores lol
3,Anyone notice anything in the Denver / Boulder area? I see no concern and it concerns me!,What the hell is going on. Anyone around this area notice anything?,PandemicPreps,1583023089,Wizard_Knife_Fight,12,1,True,2020-02-29 16:38:09,what the hell is going on. anyone around this area notice anything?
4,"PSA: don’t only buy a bunch of random foods, know how to cook it and get the proper nutrition.","I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",PandemicPreps,1583025573,cutting-alumination,56,1,True,2020-02-29 17:19:33,"i see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. have a good balance of foods and learn the nutrients balance. vitamin supplements are very important because fresh produce can..."


### Replace all non-letters with space

In [32]:
def replace_all_non_letters_with_space(text):
    return re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(text))

In [33]:
df['post'] = df['post'].map(replace_all_non_letters_with_space)

In [34]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp,post
0,It's going fast in TN,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",PandemicPreps,1583021338,BeautifulBalance1,18,1,True,2020-02-29 16:08:58,anyone in this group should know to buy now if you haven t but for the sake of reinforcement buy now if you haven t i took my under aged child to walmart today and picked up some more to add and it is a night and day difference from days ago toilet paper gone paper towels gone lysol down to bottles hand sanitizer and alcohol gone bleach was available but only a few rem...
1,Don’t forget about your pets!,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",PandemicPreps,1583021470,JKMSDE,12,1,True,2020-02-29 16:11:10,amazon is running low on cat and dog food my normals are sold out i feel bad for the delivery guy dropping my shit off almost lbs of food and liter
2,DIY Hand Sanitizer,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,PandemicPreps,1583022495,AccidentalDragon,44,1,True,2020-02-29 16:28:15,because you just can t find it in stores any more c isopropyl alcohol c aloe vera gel another receipt with essential oils tablespoon rubbing alcohol or tablespoons vodka drops tea tree essential oil drops lavender essential oil cup aloe vera gel teaspoon vitamin e oil optional a small bottle if i could just find some iso in the stores lol
3,Anyone notice anything in the Denver / Boulder area? I see no concern and it concerns me!,What the hell is going on. Anyone around this area notice anything?,PandemicPreps,1583023089,Wizard_Knife_Fight,12,1,True,2020-02-29 16:38:09,what the hell is going on anyone around this area notice anything
4,"PSA: don’t only buy a bunch of random foods, know how to cook it and get the proper nutrition.","I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",PandemicPreps,1583025573,cutting-alumination,56,1,True,2020-02-29 17:19:33,i see a lot of people on here buying around the same types of stuff canned beans rice etc just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example you can t live off of only eating meat products have a good balance of foods and learn the nutrients balance vitamin supplements are very important because fresh produce can...


### Remove Stop Words

In [35]:
def remove_stop_words(dataFrame):
    return [token for token in dataFrame if token not in stopwords.words('english')]

In [36]:
#Importing stopwords from nltk library
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))# Function to remove the stopwords
def stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])# Applying the stopwords to 'text_punct' and store into 'text_stop'


df["post"] = df["post"].apply(stopwords)

In [37]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp,post
0,It's going fast in TN,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",PandemicPreps,1583021338,BeautifulBalance1,18,1,True,2020-02-29 16:08:58,anyone group know buy sake reinforcement buy took aged child walmart today picked add night day difference days ago toilet paper gone paper towels gone lysol bottles hand sanitizer alcohol gone bleach available remained cold meds low even dish gloves ransacked hot dogs gone flour low go beans rice aisle child said coronavirus yes lady checking front joke bottles shampoo conditioner spouse thin...
1,Don’t forget about your pets!,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",PandemicPreps,1583021470,JKMSDE,12,1,True,2020-02-29 16:11:10,amazon running low cat dog food normals sold feel bad delivery guy dropping shit almost lbs food liter
2,DIY Hand Sanitizer,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,PandemicPreps,1583022495,AccidentalDragon,44,1,True,2020-02-29 16:28:15,find stores c isopropyl alcohol c aloe vera gel another receipt essential oils tablespoon rubbing alcohol tablespoons vodka drops tea tree essential oil drops lavender essential oil cup aloe vera gel teaspoon vitamin e oil optional small bottle could find iso stores lol
3,Anyone notice anything in the Denver / Boulder area? I see no concern and it concerns me!,What the hell is going on. Anyone around this area notice anything?,PandemicPreps,1583023089,Wizard_Knife_Fight,12,1,True,2020-02-29 16:38:09,hell going anyone around area notice anything
4,"PSA: don’t only buy a bunch of random foods, know how to cook it and get the proper nutrition.","I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",PandemicPreps,1583025573,cutting-alumination,56,1,True,2020-02-29 17:19:33,see lot people buying around types stuff canned beans rice etc remember know cook item buy combination foods right nutrients example live eating meat products good balance foods learn nutrients balance vitamin supplements important fresh produce go bad stored remember buy purpose good luck prepping guys


### Spelling Correction

In [38]:
def compare(corrected_text, original_text):  
    
    l1 = list(corrected_text)
    l2 = list(original_text)
#     print(l1)
    l1_ = [line.split(' ') for line in l1][0]
    l2_ = [line.split(' ')for line in l2][0]
#     print(l1)
    good = 0
    bad = 0
    for i in range(0, len(l1)):
        if l1_[i] != l2_[i]:
            bad += 1
            print(l1_[i] , l2_[i])
        else:
            good += 1
    print(f'Number of accurate words are= {good},\
          \nNumber of corrected words= {bad},\
          \nCorrection Percentage={np.round(bad*100/(len(l1)), 1)}%')


In [39]:
def correct_spell(original_text_df):
    
    return original_text_df.apply(lambda x: str(TextBlob(x).correct()))   # Correcting the text
    

In [40]:
# df['post'] = correct_spell(original_text_df=df['post'])

In [41]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp,post
0,It's going fast in TN,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",PandemicPreps,1583021338,BeautifulBalance1,18,1,True,2020-02-29 16:08:58,anyone group know buy sake reinforcement buy took aged child walmart today picked add night day difference days ago toilet paper gone paper towels gone lysol bottles hand sanitizer alcohol gone bleach available remained cold meds low even dish gloves ransacked hot dogs gone flour low go beans rice aisle child said coronavirus yes lady checking front joke bottles shampoo conditioner spouse thin...
1,Don’t forget about your pets!,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",PandemicPreps,1583021470,JKMSDE,12,1,True,2020-02-29 16:11:10,amazon running low cat dog food normals sold feel bad delivery guy dropping shit almost lbs food liter
2,DIY Hand Sanitizer,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,PandemicPreps,1583022495,AccidentalDragon,44,1,True,2020-02-29 16:28:15,find stores c isopropyl alcohol c aloe vera gel another receipt essential oils tablespoon rubbing alcohol tablespoons vodka drops tea tree essential oil drops lavender essential oil cup aloe vera gel teaspoon vitamin e oil optional small bottle could find iso stores lol
3,Anyone notice anything in the Denver / Boulder area? I see no concern and it concerns me!,What the hell is going on. Anyone around this area notice anything?,PandemicPreps,1583023089,Wizard_Knife_Fight,12,1,True,2020-02-29 16:38:09,hell going anyone around area notice anything
4,"PSA: don’t only buy a bunch of random foods, know how to cook it and get the proper nutrition.","I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",PandemicPreps,1583025573,cutting-alumination,56,1,True,2020-02-29 17:19:33,see lot people buying around types stuff canned beans rice etc remember know cook item buy combination foods right nutrients example live eating meat products good balance foods learn nutrients balance vitamin supplements important fresh produce go bad stored remember buy purpose good luck prepping guys


# Stemmizing
When we "stem" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization.

In [42]:
Pstemmizer = PorterStemmer()

In [43]:
def make_token(post):
    tokenizer = RegexpTokenizer(r'\w+') # remove the punctuation 
    post_tokens = tokenizer.tokenize(post)
    stem_spam = [Pstemmizer.stem(token) for token in post_tokens]
    return (' '.join(stem_spam))
    

In [44]:
df['token'] = list(map(make_token,df['post']))

In [45]:
df[['selftext','post','token']]

Unnamed: 0,selftext,post,token
0,"Anyone in this group should know to buy now if you haven't, but for the sake of reinforcement - buy now if you haven't! I took my under 10 aged child to Walmart today and picked up some more to add, and it is a night and day difference from 3 days ago. Toilet paper gone. Paper towels gone. Lysol, down to 4 bottles. Hand sanitizer and alcohol, gone. Bleach was available but only a few rem...",anyone group know buy sake reinforcement buy took aged child walmart today picked add night day difference days ago toilet paper gone paper towels gone lysol bottles hand sanitizer alcohol gone bleach available remained cold meds low even dish gloves ransacked hot dogs gone flour low go beans rice aisle child said coronavirus yes lady checking front joke bottles shampoo conditioner spouse thin...,anyon group know buy sake reinforc buy took age child walmart today pick add night day differ day ago toilet paper gone paper towel gone lysol bottl hand sanit alcohol gone bleach avail remain cold med low even dish glove ransack hot dog gone flour low go bean rice aisl child said coronaviru ye ladi check front joke bottl shampoo condition spous think overplay taken video could see saw pleas g...
1,"Amazon is running low on cat and dog food, my normals are sold out. I feel bad for the delivery guy dropping my shit off, almost 300lbs of food and liter.",amazon running low cat dog food normals sold feel bad delivery guy dropping shit almost lbs food liter,amazon run low cat dog food normal sold feel bad deliveri guy drop shit almost lb food liter
2,Because you just can't find it in stores any more...\n\n [http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/](http://www.utahpreppers.com/2009/04/pandemic-preparedness-diy-sanitization/) \n\n* 5 c 91% isopropyl alcohol\n* 2 c aloe vera gel\n\nHere's another receipt with essential oils\n\n [https://www.asiaone.com/lifestyle/make-your-own-diy-hand-sanitizer](https://www....,find stores c isopropyl alcohol c aloe vera gel another receipt essential oils tablespoon rubbing alcohol tablespoons vodka drops tea tree essential oil drops lavender essential oil cup aloe vera gel teaspoon vitamin e oil optional small bottle could find iso stores lol,find store c isopropyl alcohol c alo vera gel anoth receipt essenti oil tablespoon rub alcohol tablespoon vodka drop tea tree essenti oil drop lavend essenti oil cup alo vera gel teaspoon vitamin e oil option small bottl could find iso store lol
3,What the hell is going on. Anyone around this area notice anything?,hell going anyone around area notice anything,hell go anyon around area notic anyth
4,"I see a lot of people on here buying around the same types of stuff (canned beans, rice, etc). Just remember to know how to cook each item you buy and which combination of foods will have the right nutrients for example: you can’t live off of only eating meat products. Have a good balance of foods and learn the nutrients balance. Vitamin supplements are very important because fresh produce can...",see lot people buying around types stuff canned beans rice etc remember know cook item buy combination foods right nutrients example live eating meat products good balance foods learn nutrients balance vitamin supplements important fresh produce go bad stored remember buy purpose good luck prepping guys,see lot peopl buy around type stuff can bean rice etc rememb know cook item buy combin food right nutrient exampl live eat meat product good balanc food learn nutrient balanc vitamin supplement import fresh produc go bad store rememb buy purpos good luck prep guy
...,...,...,...
2357,So right now I have about 50 days of food prepped for 2 people (50 each) at 2000 calories. \n\nI’m not sure if I should get more. I’ve got all my other supplies with just a few things here and there but nothing critical I need anymore.\n\nI’m just not sure if 50 days per person for me and my partner is enough or if I should get more. \n\nHow many days of food are you prepping for each person?,right days food prepped people calories sure get got supplies things nothing critical need anymore sure days per person partner enough get many days food prepping person,right day food prep peopl calori sure get got suppli thing noth critic need anymor sure day per person partner enough get mani day food prep person
2358,"\nEven if you avoid non essential travel, I'm sure some people might still need to fly long distances for various obligations.\n\nAre there guides or tips on what exactly one should do while going through security check, same sitting in the plane?\n\nFor instance, how to sanitize your plane seat?\n\nAny advice would be appreciated!!",even avoid non essential travel sure people might still need fly long distances various obligations guides tips exactly one going security check sitting plane instance sanitize plane seat advice would appreciated,even avoid non essenti travel sure peopl might still need fli long distanc variou oblig guid tip exactli one go secur check sit plane instanc sanit plane seat advic would appreci
2359,Why are many people expecting power outages and/or water contamination/loss of water during COVID-19 pandemic? \n\nIt’s an honest question and I’d like to understand people’s reasoning as maybe I’m overlooking something important. I keep hearing people say they are ready for the coming power outages or seeing it online.\n\nI’ve been following news and individual reports from hard hit areas and...,many people expecting power outages water contamination loss water covid pandemic honest question like understand people reasoning maybe overlooking something important keep hearing people say ready coming power outages seeing online following news individual reports hard hit areas quarantined areas appear shut power plants water treatment plants appear disruptions sars even ebola hit areas es...,mani peopl expect power outag water contamin loss water covid pandem honest question like understand peopl reason mayb overlook someth import keep hear peopl say readi come power outag see onlin follow news individu report hard hit area quarantin area appear shut power plant water treatment plant appear disrupt sar even ebola hit area essenti personnel critic servic alway work genuin wonder co...
2360,"My husband’s been an ER RN for 20 years. He’s also been a Type 1 diabetic since he was 6. Over the course of our 26-year marriage the subject of a pandemic has come up several times. We created a master list of our needs and over the years we’ve purchased the big items. Last month we refreshed our supply on old items, but we’re still banging our heads against the same and most important subjec...",husband er rn years also type diabetic since course year marriage subject pandemic come several times created master list needs years purchased big items last month refreshed supply old items still banging heads important subject solar mini fridge insulin lose power feel dumb overwhelmed try understand need house root cellar giant generator apartment dwellers second floor yard ability mount so...,husband er rn year also type diabet sinc cours year marriag subject pandem come sever time creat master list need year purchas big item last month refresh suppli old item still bang head import subject solar mini fridg insulin lose power feel dumb overwhelm tri understand need hous root cellar giant gener apart dweller second floor yard abil mount solar panel hope get advic someon tackl issu s...


## Save text-processed doc

In [46]:
# check nulls
df.isnull().sum()

title           0
selftext        0
subreddit       0
created_utc     0
author          0
num_comments    0
score           0
is_self         0
timestamp       0
post            0
token           0
dtype: int64

In [47]:
# save processed data to be used for distingushing with P
df.to_csv('../datasets/text_processed_PandemicPreps.csv')

# reduce the time from March 2020 to March 2021

In [48]:
df[ df['timestamp'] < '2021-03-09 16:20:54' ]['timestamp']

0       2020-02-29 16:08:58
1       2020-02-29 16:11:10
2       2020-02-29 16:28:15
3       2020-02-29 16:38:09
4       2020-02-29 17:19:33
               ...         
2357    2020-02-29 14:58:17
2358    2020-02-29 15:46:42
2359    2020-02-29 15:55:08
2360    2020-02-29 15:56:39
2361    2020-02-29 15:57:46
Name: timestamp, Length: 2260, dtype: object

In [49]:
df.shape

(2362, 11)

## Save text-processed doc

check nulls for the last time

In [50]:
df.isnull().sum()

title           0
selftext        0
subreddit       0
created_utc     0
author          0
num_comments    0
score           0
is_self         0
timestamp       0
post            0
token           0
dtype: int64

In [51]:
df.to_csv('../datasets/text_processed_PandemicPreps_Mar2020_Mar2021.csv')