# Cleaning data
In this section of the project, the data is called from the dataset folder and some edits are applied.
At the beginning some essential libraries are installed.

In [1]:
import pandas as pd
import regex as re
import warnings
warnings.filterwarnings('ignore')
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import pickle

Install some libraries if it is needed using ```pip install libraries_name```

In [2]:
# !pip install nltk
# !pip install regex

## Nasa Data

In [3]:
file_path = "../DataSet/"
file_name = "df_nasa.csv"
df_nasa = pd.read_csv(file_path+file_name)

In [4]:
df_nasa.shape

(6000, 83)

Check the column names and details as follow.

In [5]:
df_nasa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 83 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     6000 non-null   int64  
 1   index                          6000 non-null   int64  
 2   all_awardings                  5492 non-null   object 
 3   allow_live_comments            4249 non-null   object 
 4   author                         6000 non-null   object 
 5   author_cakeday                 19 non-null     object 
 6   author_flair_background_color  14 non-null     object 
 7   author_flair_css_class         58 non-null     object 
 8   author_flair_richtext          6000 non-null   object 
 9   author_flair_template_id       30 non-null     object 
 10  author_flair_text              58 non-null     object 
 11  author_flair_text_color        58 non-null     object 
 12  author_flair_type              6000 non-null   o

Choose following column names.

In [6]:
keep_clmns = ['author', 'created_utc', 'domain', 'id', 'num_comments', 'over_18',
       'post_hint', 'score', 'selftext',
       'title']

In [7]:
df_nasa_keep_colmn = df_nasa[keep_clmns]

In [8]:
df_nasa_keep_colmn.head(5)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,selftext,title
0,illichian,1579413305,i.imgur.com,eqsltj,2,False,link,1,,A star shining through Saturn's rings
1,itstie,1579412680,i.redd.it,eqsibf,0,False,,1,,From Smithsonian National Air and Space Museum
2,NASA_POTD_bot,1579410507,apod.nasa.gov,eqs6cb,0,False,,1,,M1: The Incredible Expanding Crab Nebula
3,AMC-Eagle85,1579410277,i.redd.it,eqs4zd,6,False,,1,,Columbia ready for STS-107
4,BorisTheSpacePerson,1579404939,i.redd.it,eqr7wu,0,False,,1,,I went to ksc for Christmas and got to see wha...


In [9]:
df_nasa_keep_colmn.isnull().sum()

author             0
created_utc        0
domain             0
id                 0
num_comments       0
over_18            0
post_hint       3708
score              0
selftext        5029
title              0
dtype: int64

I choose the same approach as Meghani did toward the imputing and dropping and editing columns.

In [10]:
df_nasa_keep_colmn["title"].fillna(" ", inplace=True)
df_nasa_keep_colmn["selftext"].fillna(" ", inplace=True)

df_nasa_keep_colmn['text_merged'] = df_nasa_keep_colmn['title'] + " " + df_nasa_keep_colmn['selftext']
df_nasa_keep_colmn.drop(columns = ["title", "selftext"], inplace=True)

df_nasa_keep_colmn['post_hint'].fillna("Empty", inplace=True)

Double check the colmns for null values.

In [11]:
df_nasa_keep_colmn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   author        6000 non-null   object
 1   created_utc   6000 non-null   int64 
 2   domain        6000 non-null   object
 3   id            6000 non-null   object
 4   num_comments  6000 non-null   int64 
 5   over_18       6000 non-null   bool  
 6   post_hint     6000 non-null   object
 7   score         6000 non-null   int64 
 8   text_merged   6000 non-null   object
dtypes: bool(1), int64(3), object(5)
memory usage: 381.0+ KB


In [12]:
df_nasa_keep_colmn.head()

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged
0,illichian,1579413305,i.imgur.com,eqsltj,2,False,link,1,A star shining through Saturn's rings
1,itstie,1579412680,i.redd.it,eqsibf,0,False,Empty,1,From Smithsonian National Air and Space Museum
2,NASA_POTD_bot,1579410507,apod.nasa.gov,eqs6cb,0,False,Empty,1,M1: The Incredible Expanding Crab Nebula
3,AMC-Eagle85,1579410277,i.redd.it,eqs4zd,6,False,Empty,1,Columbia ready for STS-107
4,BorisTheSpacePerson,1579404939,i.redd.it,eqr7wu,0,False,Empty,1,I went to ksc for Christmas and got to see wha...


In [13]:
print(df_nasa_keep_colmn['text_merged'][0])
print(df_nasa_keep_colmn['text_merged'][5999])

A star shining through Saturn's rings  
This is Saturn  


## Space discussion data.

In [14]:
file_path = "../DataSet/"
file_name = "df_space.csv"
df_space = pd.read_csv(file_path+file_name)

In [15]:
df_space.shape

(6000, 78)

In [16]:
df_space.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 78 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     6000 non-null   int64  
 1   index                          6000 non-null   int64  
 2   all_awardings                  6000 non-null   object 
 3   allow_live_comments            6000 non-null   bool   
 4   author                         6000 non-null   object 
 5   author_cakeday                 22 non-null     object 
 6   author_flair_background_color  0 non-null      float64
 7   author_flair_css_class         0 non-null      float64
 8   author_flair_richtext          6000 non-null   object 
 9   author_flair_text              14 non-null     object 
 10  author_flair_text_color        14 non-null     object 
 11  author_flair_type              6000 non-null   object 
 12  author_fullname                6000 non-null   o

In [17]:
df_space_keep_colmn = df_space[keep_clmns]

In [18]:
df_space_keep_colmn.head(5)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,selftext,title
0,Orchestratorgroup,1579412906,baysidesoap.com.au,eqsjlh,0,False,,1,,Basic Goats Milk Soap Base - Low Sweat
1,Official_CIA_Account,1579412748,self.space,eqsint,0,False,,1,Sue: It looks like the NASA guys had to resche...,"Veep quote, S02E07"
2,toddvii,1579412727,self.space,eqsikq,4,False,,1,"I’m gonna be real with you, I am a 21 year old...",Online space-related degree
3,ParkingCup7,1579411077,self.space,eqs9jw,1,False,,1,&amp;#x200B;\r\n\r\nhttps://preview.redd.it/cn...,Took this with a camcorder since I don't have ...
4,nmdzwptsw,1579410811,universetoday.com,eqs831,0,False,,1,,A Mysterious Burst of Gravitational Waves Came...


In [19]:
df_space_keep_colmn.isnull().sum()

author             0
created_utc        0
domain             0
id                 0
num_comments       0
over_18            0
post_hint       3733
score              0
selftext        4824
title              0
dtype: int64

In [20]:
df_space_keep_colmn["title"].fillna(" ", inplace=True)
df_space_keep_colmn["selftext"].fillna(" ", inplace=True)

In [21]:
df_space_keep_colmn['text_merged'] = df_space_keep_colmn['title'] + " " + df_space_keep_colmn['selftext']
df_space_keep_colmn.drop(columns = ["title", "selftext"], inplace=True)

In [22]:
df_space_keep_colmn['post_hint'].fillna("Empty", inplace=True)

In [23]:
df_space_keep_colmn.head()

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged
0,Orchestratorgroup,1579412906,baysidesoap.com.au,eqsjlh,0,False,Empty,1,Basic Goats Milk Soap Base - Low Sweat
1,Official_CIA_Account,1579412748,self.space,eqsint,0,False,Empty,1,"Veep quote, S02E07 Sue: It looks like the NASA..."
2,toddvii,1579412727,self.space,eqsikq,4,False,Empty,1,Online space-related degree I’m gonna be real ...
3,ParkingCup7,1579411077,self.space,eqs9jw,1,False,Empty,1,Took this with a camcorder since I don't have ...
4,nmdzwptsw,1579410811,universetoday.com,eqs831,0,False,Empty,1,A Mysterious Burst of Gravitational Waves Came...


In [24]:
df_space_keep_colmn.isnull().sum()

author          0
created_utc     0
domain          0
id              0
num_comments    0
over_18         0
post_hint       0
score           0
text_merged     0
dtype: int64

In [25]:
print(df_space_keep_colmn['text_merged'][0])
print(df_space_keep_colmn['text_merged'][5999])

Basic Goats Milk Soap Base - Low Sweat  
The Eastern Veil Nebula  


Adding a colmn to determine the source of each data (Nasa or Space) and merging two sets of data to one.

In [26]:
#Adding one column to determine the subreddit pulled from
df_nasa_keep_colmn["subreddit"] = "NASA"
df_space_keep_colmn["subreddit"] = "Space_discussion"
df_reddit = pd.concat([df_nasa_keep_colmn, df_space_keep_colmn], axis = 0, ignore_index=True)
df_reddit.head(5)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged,subreddit
0,illichian,1579413305,i.imgur.com,eqsltj,2,False,link,1,A star shining through Saturn's rings,NASA
1,itstie,1579412680,i.redd.it,eqsibf,0,False,Empty,1,From Smithsonian National Air and Space Museum,NASA
2,NASA_POTD_bot,1579410507,apod.nasa.gov,eqs6cb,0,False,Empty,1,M1: The Incredible Expanding Crab Nebula,NASA
3,AMC-Eagle85,1579410277,i.redd.it,eqs4zd,6,False,Empty,1,Columbia ready for STS-107,NASA
4,BorisTheSpacePerson,1579404939,i.redd.it,eqr7wu,0,False,Empty,1,I went to ksc for Christmas and got to see wha...,NASA


In [27]:
df_reddit.shape

(12000, 10)

Write a function to do regex on text_merged. We are going to do the following edits:

* **Removing "\n" characters**
* **Removing the [removed] characters**
* **Use regular expressions to do a find-and-replace**
* **Making all characters lower case**
* **Replacing multiple spaces**
* **Removing stopwords**
* **Instantiate object of class PorterStemmer and stemming**
* **Adding space to stitch the words together**


In [28]:
# inserting the parent directory into current path
import sys; sys.path.insert(1, '../Functions')
import text_cleaning 
text_cleaning.Apply(df_reddit)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,text_merged,subreddit
0,illichian,1579413305,i.imgur.com,eqsltj,2,False,link,1,star shine saturn ring,NASA
1,itstie,1579412680,i.redd.it,eqsibf,0,False,Empty,1,smithsonian nation air space museum,NASA
2,NASA_POTD_bot,1579410507,apod.nasa.gov,eqs6cb,0,False,Empty,1,incred expand crab nebula,NASA
3,AMC-Eagle85,1579410277,i.redd.it,eqs4zd,6,False,Empty,1,columbia readi st,NASA
4,BorisTheSpacePerson,1579404939,i.redd.it,eqr7wu,0,False,Empty,1,went ksc christma got see made interest spacef...,NASA
...,...,...,...,...,...,...,...,...,...,...
11995,MistWeaver80,1573344101,i.redd.it,du3zh9,1,False,image,1,strike imag jupit captur nasa juno spacecraft ...,Space_discussion
11996,Idontlikecock,1573344003,i.redd.it,du3yub,1768,False,Empty,1,use hour exposur night sky reveal hundr galaxi...,Space_discussion
11997,Anchor-shark,1573342956,spaceflightnow.com,du3r89,8,False,Empty,1,boe identifi starlin parachut malfunct caus hu...,Space_discussion
11998,rosebudlodestar,1573339672,vimeo.com,du32pq,0,False,Empty,1,carl sagan read complet chapter book pale blue...,Space_discussion


In [29]:
df_reddit.shape

(12000, 10)

In [30]:
df_reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   author        12000 non-null  object
 1   created_utc   12000 non-null  int64 
 2   domain        12000 non-null  object
 3   id            12000 non-null  object
 4   num_comments  12000 non-null  int64 
 5   over_18       12000 non-null  bool  
 6   post_hint     12000 non-null  object
 7   score         12000 non-null  int64 
 8   text_merged   12000 non-null  object
 9   subreddit     12000 non-null  object
dtypes: bool(1), int64(3), object(6)
memory usage: 855.6+ KB


Now let's use the pickle library to save the data.

In [31]:
pickle.dump(df_reddit, open('../DataSet/df_reddit.pkl', 'wb'))

In [32]:
pickle.dump(df_nasa_keep_colmn, open('../DataSet/df_nasa_keep_colmn.pkl', 'wb'))
pickle.dump(df_space_keep_colmn, open('../DataSet/df_space_keep_colmn.pkl', 'wb'))