# NLP using Spacy

The first step for a text string, when working with spaCy, is to pass it to an NLP object. This object is essentially a pipeline of several text pre-processing operations through which the input text string has to go through.  The NLP pipeline has multiple components, such as tokenizer, tagger, parser, ner, etc. So, the input text string has to go through all these components before we can work on it.

![](spacy_pipeline.png)

## Data: Small sample of trump's tweets

In [None]:
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import spacy 

path = Path.cwd()

In [None]:
tweets = pd.read_csv(path / "djt_tweets_small.csv", usecols=['text', 'source'])
print(tweets.shape)
tweets.head()

In [None]:
nlp = spacy.load("en_core_web_sm")

## WORD TOKENIZE

In [None]:
# add doc col


In [None]:
# original tweet


In [None]:
# tokenized tweet


## LEMMA

Shows the root of s word

In [None]:
tweets['text'][50]

# POS tags

POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.

In [None]:
print(tweets.text[300])

## NER

NER(Named Entity Recognition) is the process of getting the entity names

In [None]:
def id_entity(doc):
    '''
    Highlights entity names in a tweet
    '''
    
    pass

In [None]:
#0

In [None]:
#240

In [None]:
#450

## Redact names

In [None]:
def redact_names(doc):
    '''
    Redact person names in a tweet
    '''
    
    pass
    

In [None]:
redact_names(tweets.doc[240])

## STOP WORDS REMOVAL

In [None]:
print(list(tweets['doc'][0]))

In [None]:
def remove_stop_words(doc, only_nouns=False):
    '''
    Takes a tweet and removes the stop words and punctuation if only_nouns=False. If only_nouns=True
    only nouns are kept.
    '''
    
    pass

In [None]:
words = remove_stop_words(tweets['doc'][0])
print(words)

## What does Trump talk about?

One way to explore this would be to mine out all the nouns from all his tweets!

In [None]:
# concat all tweets

In [None]:
# keep only nouns

In [None]:
sns.countplot(y="Trump Topics",
             data=df_nouns,
             order=df_nouns["Trump Topics"].value_counts().iloc[:10].index)
plt.show()