In [1]:
import pandas as pd 
import spacy
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv('../data/books.csv')

The third line is loading a trained english language model.

In [2]:
print(df)

      name favorite_genre                                 description
0    Alice        Fantasy          Loves dragons and mysterious lands
1      Bob         Sci-Fi  Enjoys exploring space and futuristic tech
2  Charlie         Horror       Fascinated by ghosts and dark stories


Now, I'll try using one of the descriptions to see if spacy is working as I intend.

In [3]:
text = df['description'][0]
print(text)

Loves dragons and mysterious lands


In [4]:
doc = nlp(text.lower())

This runs it through spaCy's NLP pipeline.

In [5]:
tokens = []
for token in doc:
    if not token.is_stop and not token.is_punct:
        tokens.append(token.lemma_)
print(tokens)

['love', 'dragon', 'mysterious', 'land']


I removed stopwords (and), punctuation and implemented lemmatization (eg: running -> run) and created tokens of all the valid words.
Now, I'll do this for all the descriptions.

In [7]:
def clean_description(text):
    doc = nlp(text.lower())
    return [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

df['cleaned_tokens'] = df['description'].apply(clean_description)
print(df[['description','cleaned_tokens']])

                                  description  \
0          Loves dragons and mysterious lands   
1  Enjoys exploring space and futuristic tech   
2       Fascinated by ghosts and dark stories   

                              cleaned_tokens  
0           [love, dragon, mysterious, land]  
1  [enjoy, explore, space, futuristic, tech]  
2            [fascinate, ghost, dark, story]  
