```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.2 Simple topic identification
```

# Introduction to gensim

## What is gensim?

* It is a popular open-source NLP library.

* It uses top academic models to perform complex tasks:

	* building document or word vectors

	* performing topic identification and document comparison

## What is a word vector?

![What is a word vector](ref2.%20What%20is%20a%20word%20vector.jpg)

## Code of text preprocessing with Python:

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

In [2]:
from nltk.corpus import stopwords

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

In [3]:
from nltk.stem import WordNetLemmatizer

text = """Cats, dogs and birds are common pets. So are fish."""
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]
no_stops = [t for t in tokens if t not in stopwords.words('english')]
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
print(lemmatized)

['cat', 'dog', 'bird', 'common', 'pet', 'fish']


## Practice question for text preprocessing steps:

* Which of the following are useful text preprocessing steps?
    
    $\Box$ Stems, spelling corrections, lowercase.
    
    $\boxtimes$ Lemmatization, lowercasing, removing unwanted tokens.

    $\Box$ Removing stop words, leaving in capital words.
    
    $\Box$ Strip stop words, word endings and digits.

## Practice exercises for simple text preprocessing:

$\blacktriangleright$ **Package pre-loading:**

In [4]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

$\blacktriangleright$ **Data pre-loading:**

In [5]:
article = open('ref1. Wikipedia article - Debugging.txt').read()
tokens = word_tokenize(article)
lower_tokens = [t.lower() for t in tokens]
english_stops = stopwords.words('english')

$\blacktriangleright$ **Text preprocessing practice:**

In [6]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('debugging', 40), ('system', 25), ('bug', 17), ('software', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('debugger', 13)]
