# Latent Semantic Analysis:
## _Data Mining for Meaningful Concepts In Christianity Newsgroups_
---

Prepared By: Jason Schenck  
Date: February 6th 2017  
CSC-570 Data Science Essentials


<br>
<big>Table Of Contents</big>

---
* **[1 Introduction][Introduction]**
   * [1.1][1.1] _Purpose & Data Source_
   * [1.2][1.2] _What is a "Latent Semantic Analysis"?_
   * [1.3][1.3] _Terminology Defined_
   * [1.4][1.4] _Process/Procedure & Methodology_


* **[2 Data Preparation][Data Preparation]**
   * [2.1][2.1] _Data Retrieval_
   * [2.2][2.2] _Data Inspection_
   * [2.3][2.3] _Defining 'stopwords'_


* **[3 Latent Semantic Analysis (LSA)][Latent Semantic Analysis (LSA)]**
   * [3.1][3.1] _TF-IDF Vectorization_
   * [3.2][3.2] _SVD Modeling with Scikit-Learn_


* **[4 Results: Interpration Of Extracted Concepts][Results: Interpration Of Extracted Concepts]**



     
[Introduction]: #1-Introduction
[1.1]: #1.1-Purpose-&-Data-Source
[1.2]: #1.2-What-is-a-"Latent-Semantic-Analysis"?
[1.3]: #1.3-Terminology-Defined
[1.4]: #1.4-Process/Procedure-&-Methodology
[Data Preparation]: #2-Data-Preparation
[2.1]: #2.1-Data-Retrieval
[2.2]: #2.2-Data-Inspection
[2.3]: #2.3-Defining-'stopwords'
[Latent Semantic Analysis (LSA)]: #3-Latent-Semantic-Analysis-(LSA)
[3.1]: #3.1-TF-IDF-Vectorization
[3.2]: #3.2-SVD-Modeling-with-Scikit-Learn
[Results: Interpration Of Extracted Concepts]: #4-Results:-Interpration-Of-Extracted-Concepts


<br>

<html>
<div class="alert alert-success">
<b>Data Source</b><a href="http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#">"Twenty Newsgroups", Provided By:<b> Scikit-Learn</b></b></a>
</div>
</html>
---

### 1 Introduction

#### 1.1 Purpose & Data Source

In this analysis I will be performing data mining in an effort to extract a series of meaningful and significant concepts from a public dataset of newsgroup postings on the topic of Christianity.

The dataset, titled "Twenty Newsgroups" and is officially described as follows:
>"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."

A newsgroup is an online public forum for discussion on a particular topic. The topic that I will be extracting data from will be "Christianity" (_soc.religion.christian_). I'm very curious to see what the results of this analysis will be, and in concluding intend to share my opinion on them. 

#### 1.2 What is a "Latent Semantic Analysis"?

_Latent Semantic Analysis (LSA)_ is a technique commonly used in the field of Natural Language Processing (NLP). As a computer scientist, when performing NLP we are concerned with the interactions that that exist between computers and human language. A great portion of this field focuses on the analysis of the relationship between multiple words in a document of text contained in a collection of related documents. This is known as the subfield of _Natural Language Understanding_ and can be thought of more simply as "teaching computers how to read". 

LSA is more formally defined by [_"An Introduction to Latent Semantic Analysis" by Landauer, Foltz, & Laham_](http://lsa.colorado.edu/papers/dp1.LSAintro.pdf)
>"Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the
contextual-usage meaning of words by statistical computations applied to a large corpus of
text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word
contexts in which a given word does and does not appear provides a set of mutual
constraints that largely determines the similarity of meaning of words and sets of words to
each other."

#### 1.3 Terminology Defined

There is a vast list of new terminoloy defined by the field of NLP. Below I will briefly define those of significance to LSA which will be used extensively throughout this analysis.

* **Bag Of Words (BOW)** - An abstraction model in NLP where we consider each document of text to simply be a "bag of words" in the literal sense, such that grammar and conceptual meaning is ignored.
* **Term Frequency–Inverse Document Frequency (TF-IDF)** - A mathematical calculation for scoring the importance of a word in a document or a collection. This score value is based on _Zipf's Law_ of power distributions.
* **Term** - A single word found in a document of text.
* **Document** - A single collection of terms. Defined by the LSA study. In this case, each discussion post by a user will be a document.
* **Corpus** - A single collection of related documents.
* **Concept** - The final output of an LSA is a list of concepts. These are words, or multiple words together, which were found to have the highest significance across our corpus. They are called concepts, because they represent a meaningful 'conceptualization' that has been extracted from the corpus.

#### 1.4 Process/Procedure & Methodology

In brief, I will summarize a list of 7 steps representing the overall process required to perform an LSA:

1. Collect/Retrieve a dataset containing text of interest. 
2. Define which text in the dataset will be represented as documents (sentences, discussion board poasts, news articles, ?)
3. Using the BOW model, parse by document and store words in a BOW where each bag is a document. Ending result should be a collection of documents of terms.
4. Clean the data by removing any non-alphanumeric characters such as HTML or XML tagging. Next, remove words that have very high frequency of repetition across the corpus, but with little to no significance. Due to the nature of 'TF-IDF' which relies on the _inverse_ frequency of significant terms across the corpus, this part of the process is not a straightforward one. Instead, by trial and error remove words with caution and sparingly, then re-test the model. This means steps 1-7 are completed, however you then must test and repeat this step possibly several times until the desired output is achieved. 
5. Perform TF-IDF Vectorization. This scores the words as terms for each document and across the corpus.
6. Matrix decomposition using the SVD algorithm.
7. Output a list of concepts extracted. 

Now we can begin our prepartions for LSA, starting with step 1, importing the dataset.

### 2 Data Preparation

#### 2.1 Data Retrieval

In [2]:
# Imports, and dataset download via sk-learn
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import math
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import re


posts = open('ESV.dat', 'r').read()
soup = BeautifulSoup(posts, 'lxml')

In [5]:
# Filter the soup for only the values found between the <text> tags, rename the variable for ease of reading.
postTxt = soup.decode_contents

#### 2.2 Data Inspection

In [6]:
postTxt

<bound method Tag.decode_contents of <html><body><p>{
    "Judges": {
        "11": {
            "24": "Will you not possess what Chemosh your god gives you to possess? And all that the LORD our God has dispossessed before us, we will possess.", 
            "25": "Now are you any better than Balak the son of Zippor, king of Moab? Did he ever contend against Israel, or did he ever go to war with them?", 
            "26": "While Israel lived in Heshbon and its villages, and in Aroer and its villages, and in all the cities that are on the banks of the Arnon, 300 years, why did you not deliver them within that time?", 
            "27": "I therefore have not sinned against you, and you do me wrong by making war on me. The LORD, the Judge, decide this day between the people of Israel and the people of Ammon.\"", 
            "20": "but Sihon did not trust Israel to pass through his territory, so Sihon gathered all his people together and encamped at Jahaz and fought with Israel.", 
     

In [2]:
# Check how many documents (forum posts) are in the dataset
len(corpus)

997

In [3]:
# Check the first document
corpus[0]

'From: sciysg@nusunix1.nus.sg (Yung Shing Gene)\nSubject: Mission Aviation Fellowship\nOrganization: National University of Singapore\nLines: 3\n\nHi,\n\tDoes anyone know anything about this group and what they\ndo? Any info would be appreciated. Thanks!\n'

In [4]:
 #Uncomment this block to inspect a sample of 10 documents *
# Print the first 10 documents to inspect the data
for x in range(0,12):
    print(corpus[x])


From: sciysg@nusunix1.nus.sg (Yung Shing Gene)
Subject: Mission Aviation Fellowship
Organization: National University of Singapore
Lines: 3

Hi,
	Does anyone know anything about this group and what they
do? Any info would be appreciated. Thanks!

From: whitsebd@nextwork.rose-hulman.edu (Bryan Whitsell)
Subject: Re: Satan and TV
Reply-To: whitsebd@nextwork.rose-hulman.edu
Organization: News Service at Rose-Hulman
Lines: 14

In article <May.9.05.41.06.1993.27543@athos.rutgers.edu>  
salaris@niblick.ecn.purdue.edu (Rrrrrrrrrrrrrrrabbits) writes:
> MTV controls what bands are popular, no matter how bad they are.  In fact, it is  
>better to be politically correct - like U2, Madonna - than to have any musical  
>talent. 
> Steven C. Salaris                
 
Interesting idea.  
Regular televeision seems to do this sort of thing too with politically correct  
shows.


In Christ's Love
Bryan 

From: whitsebd@nextwork.rose-hulman.edu (Bryan Whitsell)
Subject: Re: "Accepting Jesus in your heart

**Observations:**  
It appears our data is in plain text with no tagging. However, each post starts with a heading which I've noticed is also variable across the corpus. For example some posts start with a header containing "From", "Subject", and "Organization" while others do not. The following headers are present across the corpus:  
* **From:** [ _email@emailaddress.com_ ]
* **Subject:** [ _topic_ ]
* **Reply-To:** [ _email@emailaddress.com_ ]
* **Organization:** [ _Organization Name_ ]
* **Lines:** [ _# Lines of post_ ]

Also, it appears that a post can be from either a public individual or a member of an organization. In either case, posts can also be both new posts or replies to other's posts. Every header ends with "Lines:" which tells us the number of lines of text contained in the post message itself.


Post content looks like it could be problemsome for LSA if I don't carefully define the stopset of exclusion words. I found that this part of the process consisted of stopset defining and repetitive model testing in order to fine-tune the results.   

One thing that I know we are going to want to exclude regardless are e-mail addresses because these items appear across the entire corpus, and therefore will decrease model performance.

In [5]:
# Using regex, find and remove all e-mail addresses in all documents across the entire corpus
corpus = [re.sub(r'(\s)(\S+\@\S+)(\s)', r'\1\3', corpus[x]) for x in range(len(corpus))]

# Check it
print(corpus[6])

From: 
Subject: tongues (read me!)
Lines: 8

Persons interested in the tongues question are are invited to
peruse an essay of mine, obtainable by sending the message
 GET TONGUES NOTRANS
 to  or to
    

 Yours,
 James Kiefer



In [6]:
# Convert all text to lower-case
postDocs = [x.lower() for x in corpus]

# Check it
print(postDocs[6])

from: 
subject: tongues (read me!)
lines: 8

persons interested in the tongues question are are invited to
peruse an essay of mine, obtainable by sending the message
 get tongues notrans
 to  or to
    

 yours,
 james kiefer



#### 2.3 Defining 'stopwords'

Now that I have removed all of the email addresses and formatted the text to lower-case, I will define the stopset. 

A 'stopset' is a list of 'stopwords' which will be excluded from analysis automatically by scikit-learn's vectorization algorithm. For this LSA, I'm going to use a combination of two pre-built lists for the first attempt: a stopset provided by _Natural Language Toolkit(NTLK)_, and one that I found online called the _Terrier stopset_.

In order to combine these two, we store them in a 'set' datastructure and perform a 'union' between them removing duplicates. 

In [7]:
# Import NTLK stopset
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jasonschenck/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
# Use this cell to add new exclusion words to the stopset before and/or after model testing.
# Note: Most of the words below were added over the course of numerous output testing efforts.

stopset = set(stopwords.words('english'))



stopset.update(['mercury','san','christiansen','dozier','athens','josh','0001','jose','lois','perry','department','editorial','etc','0358','542','706','30602','nasa','langley','subject','elizabeth'
                'phone','bell','nova','gmi','khan0095','budd','28','bud','nj','wkuvx1','bitnet','easteee','holt','gatech','carol','howard','len','hampton','va','cs','terrance','acad1','sahs','uth','randerso','larc','gov','whitesbsd','nextwork','trol','eeap','apr',
                'r2d2','vbv','n4tmi','wbt','wycliffe','ata','hfsi','uk','fidonet','jeff','fenholt','indiana','fisher','microsystems','creps','alvin','netcom','andrew','fil','revdak','jr','velasco',
                'virgilio','ac','za','hayesstw','risc1','ucs','lee','nicholas','mandock','randal','overacker','larry','bernard','elizabeth','dean','seanna','unisa','rose','bryan','bnr','jayne','heath','scott','llo','acs','vela','atterlep',
                'lines','petch','carlson','caralv','university','georgia','aisun3','reply-to','organization','hulman','hayes','steve','mcovingt','ai','ca','covington','bigelow','eugene','tek','gvg47','chuck','gvg','com','uga','bernadette','rutgers',
                'edu','quot','spacing','text','line','none','sans','line','title','word', 'neue','johnsd2','rpi','mls','panix','ebay','group','freenet','carleton','ncr','cso','uxa','uiuc','bjorn','elsegundoca','mit','koberg','gt7122b','oo','la','microsoft','kuhub','cc','ukans',
                'fnal','marka','csd','sapienza','lady','posting','rolfe','joe','jon','tom','fred','ling','siew','wee','matt5','lest','bill','wager','oakland','rochester','alan','steele','therefore','todd','aaron','bryce','a888','sledd','stan','pretoria','392','commentary',
                'cox','paz','vic','fax','713','703','3729','827','murray','dale','gary','reply','mail','gerry','tx','shall','245','shell','box','univ','aa888','traer','bruce','__','___','601','22102','708','632','trei','eggert','amateur','radio','company','houston','lincoln','408',
                '241','9760','02173','617','244','st','203','617','981','2575','subject','really','number','quite','loisc','article','baker','ashley','sj','see'])

# Potential bible verse references, originally added and then removed from stopset
# '44','31','10','11','31','14','21'

In [9]:
# Import the Terrier stopset from file, union with existing stopset
terrierstopset = open('terrierstopset.txt', 'r').read()
stopset = set(stopset).union(set(terrierstopset))

### 3 Latent Semantic Analysis (LSA)

#### 3.1 TF-IDF Vectorization

During this process, I'll be using the TfidfVectorizer() function from the scikit-learn library. This is the part of the LSA that actually converts the words of text that we have collected in to numerical representations by assigning them TF-IDF scores. 
> _ The TF-IDF score of a word 'w' is:_  
> 
> $$tf(w) * idf(w)$$
>
> _where: $$tf(w) =\frac{\text{number of times a word appears in the doc}}{\text{total number of words in the doc}}$$_ 
>
> and : $$idf(w)=  \left\{log\frac{\text{number of documents}}{\text{number of documents that contain the word w}}\right\}$$

When we vectorize, we are essentially defining a lexical analyzer that is built into scikit-learn and therefore must specify some important parameters:  

* **stopwords:** set the param to var stopset  
<br>
* **use idf:** always set to true for LSA  
<br>
* **ngram range:** 'grams' are words, and the ngram_range specifies to the analyzer the minimum(1) to the maximum(N) grams to consider for contextual relationships. I originally started this analysis with ngram_range=(1,3), however found through serveral rounds of testing and fine-tuning that (2,5) tends to produce the most optimal results.

In [10]:
# Define the vectorizer model
vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, ngram_range=(2, 4),smooth_idf=True)

# Fit the corpus data
X = vectorizer.fit_transform(postDocs)

In [11]:
# Tada! This is now the output of the first document in the corpus, in sparse IDF matrix form.
print(X[0])

  (0, 420714)	0.158640819949
  (0, 338335)	0.158640819949
  (0, 152859)	0.158640819949
  (0, 243857)	0.158640819949
  (0, 39327)	0.158640819949
  (0, 141944)	0.158640819949
  (0, 251236)	0.138487062613
  (0, 343183)	0.158640819949
  (0, 175992)	0.143395091067
  (0, 28432)	0.115840647931
  (0, 205145)	0.131086376814
  (0, 28878)	0.158640819949
  (0, 189799)	0.158640819949
  (0, 412636)	0.138487062613
  (0, 31117)	0.138487062613
  (0, 420715)	0.158640819949
  (0, 338336)	0.158640819949
  (0, 152860)	0.158640819949
  (0, 243858)	0.158640819949
  (0, 39328)	0.158640819949
  (0, 141945)	0.158640819949
  (0, 251245)	0.158640819949
  (0, 343184)	0.158640819949
  (0, 175993)	0.143395091067
  (0, 28435)	0.149722640257
  (0, 205149)	0.158640819949
  (0, 28879)	0.158640819949
  (0, 189800)	0.158640819949
  (0, 412639)	0.143395091067
  (0, 420716)	0.158640819949
  (0, 338337)	0.158640819949
  (0, 152861)	0.158640819949
  (0, 243859)	0.158640819949
  (0, 39329)	0.158640819949
  (0, 141946)	0.158640

In [12]:
# The current shape is (documents, terms)
X.shape

(997, 420957)

#### 3.2 SVD Modeling with Scikit-Learn

**Single Value Decomposition** (SVD) is the process of taking our corpus of matrices (X), and performing _matrix decomposition_ such that:

<big>$$X \approx USV^{T}$$</big>

where...

* **X** = Original corpus matrix
* **m** = Number of documents contained in X
* **n** = Number of terms
<br>

 
**_X is decomposed into three matricies called U, S, and T with k-value such that..._**  



>* **k** = Number of concepts we want to mine for
>
>
>* **U** = An {'_m x k_'} matrix.  
>  * _Rows_ = Documents
>  * _Columns_ = Concepts
>* **S** = A {'_k x k_'} diagonal matrix. 
>  * _Elements_ =  Variation captured from each concept.
>* **V** = An {'_n x k_'} matrix.
>  * _Rows_ = Terms
>  * _Columns_ = Concepts
>

This is an advanced mathematical procedure involving linear algebra which will decompose our matrix X into three U,S,& V. The entire process is built-in to scikit-learn as an engine model, all we must do is define the model specifications and let it do the work for us. 

[**scikit-learn**](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) provides the following documentation on this function:  
> "Dimensionality reduction using truncated SVD (aka LSA).
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices efficiently.
In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).
This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient."

In [13]:
# Defining the TruncatedSVD model

# Params: n_components=100 for LSA per sk-learn doc, n_iter=5 (default, and should be adjusted during testing) 
lsa = TruncatedSVD(n_components=100, n_iter=5)

# Fit the model
lsa.fit(X)


TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
       random_state=None, tol=0.0)

In [14]:
# After decomposition, 'lsa.components_[]' represents matrix V'
lsa.components_[0]

array([  5.45641047e-06,   5.45641047e-06,   5.45641047e-06, ...,
         2.02826051e-05,   2.02826051e-05,   2.02826051e-05])

In [15]:
import sys
print (sys.version)

3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]


In [16]:
# Convert the SVD results from numerical representation, back to their appropriate word text form.
# Iterates over the enumeration of matrix components, for each: zips the terms to components, sorts them, then prints. 
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Concept %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")

Concept 0:
priest priest
four years
immaculate conception
answer priest
told priest
years old
case doctrine
apparition deemed
apparition deemed true
apparition deemed true sealed
 
Concept 1:
secretary interior
married god
god eyes
appointee james
appointee james watt
appointee james watt pentacostal
christian think
christian think secretary
christian think secretary interior
days would last
 
Concept 2:
grass valley
daily verse grass
daily verse grass valley
verse grass
verse grass valley
verse grass valley grass
grass valley grass
grass valley grass valley
valley grass
valley grass valley
 
Concept 3:
married god
god eyes
married god eyes
two people
people married god
two people married god
people married
two people married
become married god
become married god eyes
 
Concept 4:
eternal death
hate sin
original sin
christians hell
atheists hell
commands us
atheists believe
since bible
bible problem
bible problem view
 
Concept 5:
hate sin
commands us
love sinner
sin love
hate sin love

### 4 Results: Interpration Of Extracted Concepts

**Observations**  
In order to produce the above output, it took several attempts of fine-tuning the stopset list, vectorization parameters, and SVD parameters then re-running the model. During this process I found that stopset word selection can be tricky, because only those terms which repeat the most across the entire corpus should be excluded. If one unknowingly removes a term which is sparsely found in the corpus, then the efficiency of the model is reduced negatively, impacting both performance and the output of concepts. 

After trying a handful of different variations, I found the following parameters produce the most meaningful extraction of concepts:  
- TfidfVectorizer(stop_words=stopset,use_idf=True, **ngram_range=(2, 4**)
- TruncatedSVD(n_components=100, **n_iter=5**)

Other configurations tested include ngram_range(1,3), (2,2), (2,3), (2,5), (3,3), and (1,4). For ngrams < 2 the results lacked substance and returned only very simple concepts such as: God, sin, hate, and love. As ngrams_range was adjusted the resulting concepts became much more intricate and meaningful. I also ran a few different configurations with different values for n_iter (epochs), and noticed that this significantly affected the runtime efficiency of the model for any values ~n_iter > 30. I tested n_iter=100, while it took well over 3 minutes to complete execution, the resulting concepts did not appear to have improved much, if at all. 

The exclusion words were updated several times as well with each test ran, and mostly what I found was that removing certain terms, about 4-5 at a time, then re-testing the model proved successful in the long-run. Specifically the output concepts were checked for terms which appeared out of place, and just 'odd', and then added to the stopset.

An important observation made, was that of certain numbers that repeated as concept output. This was super tricky to filter for, as some were extremely significant actually representing bible verses that fit perfectly to the concept (ex: lev 18 22), while others were junk such as the following three numbers:'706','542','0358', which is actually the telephone number for the A.I. department at Georgia Tech! (_If you see a number produced as part of a concept, Google that number to find the bible verse. It proves to be very significant._)  

**Interesting Findings**  
Christianity is a topic that I am not personally very familiar with, which is in part why I chose it for this study. I wanted to see if I could extract concepts that were very clear to even an observer who is unknowledgeable on  the topic such as myself. 

I performed some research on a few of the more interesting concepts and ended up with some pretty awesome discoveries:

- Ideological Manipulation ([Wikipedia](https://en.wikipedia.org/wiki/Dominant_ideology)):
>  "Social control exercised and effected by means of the _ideological manipulation_ of aspects of the common culture of a society — religion and politics, culture and economy, etc. — to explain and justify the status quo to the political advantage of the dominant (ruling) class..."  

- James G. Watt ([Wikipedia](https://en.wikipedia.org/wiki/James_G._Watt)):
> "James Gaius Watt (born January 31, 1938) served as U.S. Secretary of the Interior from 1981 to 1983. Often described as "anti-environmentalist", he was one of Ronald Reagan's most controversial cabinet appointments."
>
> "In 1995, Watt was indicted on 25 counts of felony perjury and obstruction of justice by a federal grand jury, accused of making false statements before the grand jury investigating influence peddling at the Department of Housing and Urban Development, which he had lobbied in the 1980s"

- Speaking In Tongues ([Wikipedia](https://en.wikipedia.org/wiki/Glossolalia)):
> "Glossolalia or speaking in tongues, according to linguists, is the fluid vocalizing of speech-like syllables that lack any readily comprehended meaning, in some cases as part of religious practice in which it is believed to be a divine language unknown to the speaker."