# NLP Processing With SpaCy
---

## Contents
---
- [Data Retrieval](#Data-Retrieval)
- [SpaCy Processing](#SpaCy-Processing)

### Data Retrieval
___

**Library Imports**

In [44]:
import pandas as pd
import re
import spacy

**Read in cleaned_corpus.csv**

In [45]:
cleaned_corpus = pd.read_csv('./data/cleaned_corpus.csv')

In [46]:
cleaned_corpus.head()

Unnamed: 0,subreddit,text
0,1,The Traveling Journal of r/fountainpens (Appro...
1,1,"Sometimes, we just need this simple reminder 🤗"
2,1,I must confess I was wrong about Kaweco I've b...
3,1,Why are cheap fountain pens so much better? Th...
4,1,Literally classic design Beautiful stripes and...


### SpaCy Processing
---

In [47]:
nlp = spacy.load('en_core_web_md')

In [48]:
# Load the medium size pipeline
nlp = spacy.load('en_core_web_md')

**Function that allows SpaCy to process text data so that it can be ran through an apply method for the 'text' column of the 'cleaned_corpus' dataset.**

In [49]:
# Function created using SpaCy lesson, references are https://spacy.io/api/token, https://realpython.com/natural-language-processing-spacy-python/#lemmatization, and ChatGPT for stucture help.  Hank reminded me that .apply will apply a function to a datafram column.

def spacy_processor(text):
    
    #Put the data into spaCy model
    doc = nlp(text)
    
    # Create a tokens list with only alpha characters and leaving out any that are only one letter - which I saw during initial tests
    tokens = [token.lemma_.lower().strip() for token in doc if token.is_alpha and not token.is_stop and len(token.text) > 1]

    #Put the processed text back together
    processed_text = ' '.join(tokens)

    #return processed text to dataframe
    return processed_text
    

In [50]:
# Apply the function to the text column of the cleaned_corpus dataframe
cleaned_corpus['processed_text'] = cleaned_corpus['text'].apply(spacy_processor)

In [51]:
#Checking out how it looks
cleaned_corpus.head()

Unnamed: 0,subreddit,text,processed_text
0,1,The Traveling Journal of r/fountainpens (Appro...,traveling journal fountainpens approve moderat...
1,1,"Sometimes, we just need this simple reminder 🤗",need simple reminder
2,1,I must confess I was wrong about Kaweco I've b...,confess wrong kaweco hobby decade especially l...
3,1,Why are cheap fountain pens so much better? Th...,cheap fountain pen well cheap pen zero problem...
4,1,Literally classic design Beautiful stripes and...,literally classic design beautiful stripe engr...


In [52]:
#Checking out the first entry in the processed_text column
cleaned_corpus['processed_text'][0]

'traveling journal fountainpens approve moderator greeting fountain pen family see original post propose pass journal new project read thrilled announce moderate team generously agree support assist ambitious undertaking great deal interested party room like project read outline instruction extremely excited move look forward amazing entry traveling journal fountainpens sub compose unique individual view experience talent creativity unite shared passion remarkable writing tool fountain pen capture uniqueness single tome sharing preservation plan straightfoward physical journal send member person add new entry typical journal diary type write entry lyric favorite meaningful song movie famous quote artwork feel like contribute periodically send archiving send member continue chain journal fill restriction add detail procedure take project fairly simple join sign submit info google form create mailing email address reddit username create free account require highly recommend save subscrib

In [53]:
# OK to drop the text column
cleaned_corpus.drop(columns = 'text', inplace = True)

In [55]:
cleaned_corpus.head()

Unnamed: 0,subreddit,processed_text
0,1,traveling journal fountainpens approve moderat...
1,1,need simple reminder
2,1,confess wrong kaweco hobby decade especially l...
3,1,cheap fountain pen well cheap pen zero problem...
4,1,literally classic design beautiful stripe engr...


I noticed when modeling that after reading in the data, there were four NaNs.  I realized that four of the rows in the dataset at thos point only had ''.  While not considered a NaN, there is no data here.  I will drop these columns.  There were either just emojis for text or stop words.

In [58]:
cleaned_corpus[cleaned_corpus['processed_text'] == '']


Unnamed: 0,subreddit,processed_text
749,1,
1140,0,
2604,1,
2661,0,


In [60]:
cleaned_corpus.drop([749,1140,2604,2661], inplace = True)

In [62]:
cleaned_corpus.shape

(2796, 2)

In [63]:
cleaned_corpus.to_csv('./data/text_processed_corpus.csv', index = False)