## NLP - Text Parsing, Stemming, Stopword removal, Term Frequency Matrix

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import re # for regex
import string
import pandas as pd
# import os

import nltk.corpus
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

import nltk # natural language toolking
nltk.download('punkt')
nltk.download('punkt_tab') # it is required in new nltk  so better keep it

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

import warnings

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdukarimabdusalomov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/abdukarimabdusalomov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/abdukarimabdusalomov/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
warnings.filterwarnings('ignore')

In [6]:
text="/@@@111Faculty      of Economic Sciences,,,, as an independent unit of the University of Warsaw, affirms its commitment to basic goals and values specified in the Mission Statement of the University of Warsaw. In regard to the way in which the mission of our Alma Mater refers to the discipline represented by the Faculty of Economic Sciences, we define the following goal and value as our priorities of special importance: unity of research and teaching is the foundation of the activities at the Faculty of Economic Sciences."
print(text)

/@@@111Faculty      of Economic Sciences,,,, as an independent unit of the University of Warsaw, affirms its commitment to basic goals and values specified in the Mission Statement of the University of Warsaw. In regard to the way in which the mission of our Alma Mater refers to the discipline represented by the Faculty of Economic Sciences, we define the following goal and value as our priorities of special importance: unity of research and teaching is the foundation of the activities at the Faculty of Economic Sciences.


## Text Parsing

## Prelimenary Cleaning

In [7]:
# replace special characters from text.
# substituting "/", "@" and "|" and others by a space.

text_clean = re.sub('[^a-zA-Z0-9 \n\.]', '', text)
print(text_clean)

111Faculty      of Economic Sciences as an independent unit of the University of Warsaw affirms its commitment to basic goals and values specified in the Mission Statement of the University of Warsaw. In regard to the way in which the mission of our Alma Mater refers to the discipline represented by the Faculty of Economic Sciences we define the following goal and value as our priorities of special importance unity of research and teaching is the foundation of the activities at the Faculty of Economic Sciences.


## Cleaning Text

#### a) to remove unnecessary spaces, punctuation and numbers

In [8]:
# remove unnecessary spaces
text_cleaner = re.sub(' +', ' ', text_clean)

In [9]:
# remove unnecessary punctuation - already done above using regex, you may try to define punctuation manually
re.sub(r'[^\w\s]','', text)

'111Faculty      of Economic Sciences as an independent unit of the University of Warsaw affirms its commitment to basic goals and values specified in the Mission Statement of the University of Warsaw In regard to the way in which the mission of our Alma Mater refers to the discipline represented by the Faculty of Economic Sciences we define the following goal and value as our priorities of special importance unity of research and teaching is the foundation of the activities at the Faculty of Economic Sciences'

In [10]:
# remove unnecessary numbers
text_cleaner = re.sub('\d', '', text_cleaner)
print(text_cleaner)

Faculty of Economic Sciences as an independent unit of the University of Warsaw affirms its commitment to basic goals and values specified in the Mission Statement of the University of Warsaw. In regard to the way in which the mission of our Alma Mater refers to the discipline represented by the Faculty of Economic Sciences we define the following goal and value as our priorities of special importance unity of research and teaching is the foundation of the activities at the Faculty of Economic Sciences.


#### b) change letters to lower case

In [11]:
# change to lowercase
print(text_cleaner.lower())

faculty of economic sciences as an independent unit of the university of warsaw affirms its commitment to basic goals and values specified in the mission statement of the university of warsaw. in regard to the way in which the mission of our alma mater refers to the discipline represented by the faculty of economic sciences we define the following goal and value as our priorities of special importance unity of research and teaching is the foundation of the activities at the faculty of economic sciences.


## Stopword removal

- In the case of "stopwords" in the package tm
- supported languages are: Danish, Dutch,
- English, Finnish, French, German, Hungarian, Italian,
- Norwegian, Portuguese, Russian, Spanish and Swedish.
- Language names are case-sensitive.

In [12]:
# remove English stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text_cleaner)

filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print()
print(filtered_sentence)

['Faculty', 'of', 'Economic', 'Sciences', 'as', 'an', 'independent', 'unit', 'of', 'the', 'University', 'of', 'Warsaw', 'affirms', 'its', 'commitment', 'to', 'basic', 'goals', 'and', 'values', 'specified', 'in', 'the', 'Mission', 'Statement', 'of', 'the', 'University', 'of', 'Warsaw', '.', 'In', 'regard', 'to', 'the', 'way', 'in', 'which', 'the', 'mission', 'of', 'our', 'Alma', 'Mater', 'refers', 'to', 'the', 'discipline', 'represented', 'by', 'the', 'Faculty', 'of', 'Economic', 'Sciences', 'we', 'define', 'the', 'following', 'goal', 'and', 'value', 'as', 'our', 'priorities', 'of', 'special', 'importance', 'unity', 'of', 'research', 'and', 'teaching', 'is', 'the', 'foundation', 'of', 'the', 'activities', 'at', 'the', 'Faculty', 'of', 'Economic', 'Sciences', '.']

['Faculty', 'Economic', 'Sciences', 'independent', 'unit', 'University', 'Warsaw', 'affirms', 'commitment', 'basic', 'goals', 'values', 'specified', 'Mission', 'Statement', 'University', 'Warsaw', '.', 'In', 'regard', 'way', '

In [13]:
# if necessaary: remove your own stopwords - as a vector of words:
stop_words_lst = ['a']

for w in stop_words_lst:
    pattern = r'\b'+w+r'\b'
    filtered_text = re.sub(pattern, '', text_cleaner)
    print(filtered_text)

Faculty of Economic Sciences as an independent unit of the University of Warsaw affirms its commitment to basic goals and values specified in the Mission Statement of the University of Warsaw. In regard to the way in which the mission of our Alma Mater refers to the discipline represented by the Faculty of Economic Sciences we define the following goal and value as our priorities of special importance unity of research and teaching is the foundation of the activities at the Faculty of Economic Sciences.


## Stemming

- Stemming reduces words to their root form
- For example, the reduction of words "move", "moved"
- and "movement" to the core "move".

In [14]:
# stem document
ps = PorterStemmer()

words = word_tokenize(text_cleaner)

for w in words:
    print(w, " : ", ps.stem(w))

Faculty  :  faculti
of  :  of
Economic  :  econom
Sciences  :  scienc
as  :  as
an  :  an
independent  :  independ
unit  :  unit
of  :  of
the  :  the
University  :  univers
of  :  of
Warsaw  :  warsaw
affirms  :  affirm
its  :  it
commitment  :  commit
to  :  to
basic  :  basic
goals  :  goal
and  :  and
values  :  valu
specified  :  specifi
in  :  in
the  :  the
Mission  :  mission
Statement  :  statement
of  :  of
the  :  the
University  :  univers
of  :  of
Warsaw  :  warsaw
.  :  .
In  :  in
regard  :  regard
to  :  to
the  :  the
way  :  way
in  :  in
which  :  which
the  :  the
mission  :  mission
of  :  of
our  :  our
Alma  :  alma
Mater  :  mater
refers  :  refer
to  :  to
the  :  the
discipline  :  disciplin
represented  :  repres
by  :  by
the  :  the
Faculty  :  faculti
of  :  of
Economic  :  econom
Sciences  :  scienc
we  :  we
define  :  defin
the  :  the
following  :  follow
goal  :  goal
and  :  and
value  :  valu
as  :  as
our  :  our
priorities  :  prioriti
of  :  of


### Term Frequency Matrix

In [15]:
wordlist = text_cleaner.split()

wordfreq = []
for w in wordlist:
    wordfreq.append(wordlist.count(w))

In [16]:
print("String\n" + text_cleaner +"\n")
print("List\n" + str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))

String
Faculty of Economic Sciences as an independent unit of the University of Warsaw affirms its commitment to basic goals and values specified in the Mission Statement of the University of Warsaw. In regard to the way in which the mission of our Alma Mater refers to the discipline represented by the Faculty of Economic Sciences we define the following goal and value as our priorities of special importance unity of research and teaching is the foundation of the activities at the Faculty of Economic Sciences.

List
['Faculty', 'of', 'Economic', 'Sciences', 'as', 'an', 'independent', 'unit', 'of', 'the', 'University', 'of', 'Warsaw', 'affirms', 'its', 'commitment', 'to', 'basic', 'goals', 'and', 'values', 'specified', 'in', 'the', 'Mission', 'Statement', 'of', 'the', 'University', 'of', 'Warsaw.', 'In', 'regard', 'to', 'the', 'way', 'in', 'which', 'the', 'mission', 'of', 'our', 'Alma', 'Mater', 'refers', 'to', 'the', 'discipline', 'represented', 'by', 'the', 'Faculty', 'of', 'Economic'

---

zipf plot - for better visualization

---