# Elementary concepts in Natural Language Processing

## Spacy
Spacy is an open-source Python library designed for natural language processing (NLP) tasks. It provides a wide range of NLP capabilities, including tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more. Spacy is known for its efficiency and ease of use, making it a popular choice for processing and analyzing textual data in various NLP applications.

In [15]:
spacy.cli.download("en_core_web_sm")
import spacy

nlp = spacy.load("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


The text to be processed

In [2]:
my_sentence = "Hello, I'm a french library"

## Tokenization
Tokenization is the process of breaking down a text or a sequence of characters into smaller units, known as tokens. These tokens can be individual words, sentences, or even subwords, depending on the level of granularity required for the analysis. Tokenization is a fundamental step in many natural language processing (NLP) tasks, as it provides a structured representation of textual data that can be further analyzed or processed.

In [17]:
def return_token_text(sentence):
    # Divide the sentence into tokens
    doc = nlp(sentence)
    # Return the text of the tokens
    return [token.text for token in doc]

In [18]:
my_tok_text = return_token_text(my_sentence)
print("Tokenizing...")
print(my_tok_text)

Tokenizing...
['Hello', ',', 'I', "'m", 'a', 'french', 'library']


## Stopwords
Stopwords refer to commonly used words in a language that are often considered insignificant for analysis purposes. These words, such as "the," "is," "and," "in," etc., occur frequently in text but typically do not carry much meaning or contribute to the understanding of the context. In NLP, stopwords are often removed from text data before further analysis or processing, as they can add noise and hinder the performance of certain tasks, such as text classification or information retrieval.

In [19]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopWords = set(stopwords.words('french'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Ignoring the stopwords

In [20]:
clean_words = []


def remove_stopwords(sentence):
    for token in return_token_text(sentence):
        if token not in stopWords:
            clean_words.append(token)


In [21]:
remove_stopwords(my_sentence)
print("Removing stopwords...")
print(clean_words)

Removing stopwords...
['Hello', ',', 'I', "'m", 'a', 'french', 'library']


# Stemming
Stemming is a technique used in NLP to reduce words to their base or root form, known as a "stem." It involves removing prefixes, suffixes, and inflectional endings from words to obtain the core morphological or semantic meaning. For example, stemming would convert the words "running," "runs," and "ran" to the common stem "run." The purpose of stemming is to normalize words, reduce dimensionality, and improve text analysis tasks such as information retrieval, search engines, and text mining. However, it should be noted that stemming can sometimes result in non-dictionary words or stems that do not carry meaningful information, as it relies on simple linguistic rules and patterns.

For example:
* Going, Went, Gone are transformed into Go
* Am, Are, Is -> Be


In [None]:
import nltk
nltk.download()

### Snowball stemming

In [22]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='french')

In [26]:
def return_stem(sentence):
    doc = nlp(sentence)
    return [stemmer.stem(s.text) for s in doc]


In [29]:
my_stem = return_stem(my_sentence)
print("Stemming")
print(my_stem)

Stemming
['hello', ',', 'i', "'m", 'the', 'french', 'librarian', 'gazing', 'a', 'the', 'star', 'abov', 'the', 'ringing']


### Porter Stemming

In [3]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

print(porter.stem("trouble"))
print(porter.stem("troubling"))
print(porter.stem("troubled"))

Porter Stemmer
cat
troubl
troubl
troubl


### Lancaster Stemming

In [13]:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()

my_sentence = "Daring talking to my darling this waying?"
my_stem = []

my_sentence = my_sentence.split()
for word in my_sentence:
  stemmedWord = lancaster.stem(word)
  my_stem.append(stemmedWord)

print("My sentence:", my_sentence)
print("Stemming...")
print("My stem:", my_stem)

Lancaster Stemmer
My sentence: ['Daring', 'talking', 'to', 'my', 'darling', 'this', 'waying?']
Stemming...
My stem: ['dar', 'talk', 'to', 'my', 'darl', 'thi', 'waying?']


In [5]:
## Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

text = ['Hello my name is is is Becaye',
'Becaye this is my python notebook',
'Becaye Becaye trying to create a big dataset',
'Becaye of words to try different',
'features of count vectorizer']

coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)

In [6]:
print(coun_vect.get_feature_names_out())

['becaye' 'big' 'count' 'create' 'dataset' 'different' 'features' 'hello'
 'is' 'my' 'name' 'notebook' 'of' 'python' 'this' 'to' 'try' 'trying'
 'vectorizer' 'words']


In [7]:
print(count_matrix)

  (0, 7)	1
  (0, 9)	1
  (0, 10)	1
  (0, 8)	3
  (0, 0)	1
  (1, 9)	1
  (1, 8)	1
  (1, 0)	1
  (1, 14)	1
  (1, 13)	1
  (1, 11)	1
  (2, 0)	2
  (2, 17)	1
  (2, 15)	1
  (2, 3)	1
  (2, 1)	1
  (2, 4)	1
  (3, 0)	1
  (3, 15)	1
  (3, 12)	1
  (3, 19)	1
  (3, 16)	1
  (3, 5)	1
  (4, 12)	1
  (4, 6)	1
  (4, 2)	1
  (4, 18)	1


<img src="bag of words.png"/>

### Converting to bag of words to an array

In [8]:
count_array = count_matrix.toarray()
print(count_array)

[[1 0 0 0 0 0 0 1 3 1 1 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0]
 [2 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0]
 [1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 1]
 [0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0]]


#### Now you are ready to dive into the Sentiment analysis notebook. 