<div style = "text-align:center">
<h1> Feature Extraction Tutorial </h1>
<h3> Keyword extraction has six main seteps </h3>
<img src="./steps.jpeg" width = "600px" height= "500px">
</div>

<h2> We first need to import the needed packages </h2>
<p>This includes: </p>
<ol>
    <li> <b>nltk.corpus.stopwords</b> : This package allows us to get curated stopwords for many different languages </li>
    <li> <b> nltk.stem.wordnet.WordNetLemmatizer </b>: This package is used for the lemmatization process. The WordNetLemmatizer uses the Princeton University WordNet database to get the root of words. WordNet has many other uses as well. More information 
        <a href="https://pythonprogramming.net/wordnet-nltk-tutorial/"> here </a> </li>
    <li> <b> re </b>: a regular expression package for python and my prefered choice for text extraction, and cleaning. NLTK also has really utilities for text cleaning as well. More information on what regular expressions are can be found <a href = "https://www.regular-expressions.info/" > here </a></li>
    <li> <b> sklearn.feature_extraction.text.CountVectorizer </b>: This package is used to generate the frequency matrix. More information can be found 
 <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">here </a> </li>
    <li> <b>pandas</b>: Since we'll be using a csv of comments today, one of the best way to manipulate this data is by using a pandas dataframe constructed from this csv. An awesome tutorial on pandas can be found <a href = "https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python"></a></li>
    
</ol>
    

In [19]:
#If this is the first time importing the stopwords package /
#you must first download the stopwords using nltk.download('stopwords')
import re
import pandas
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


<h2> Importing our data </h2>
<p> I have provided a very popular dataset called the NIPS papers you can find more information about this dataset <a href= "https://www.kaggle.com/benhamner/nips-papers/home"> </a>

In [20]:
#here we construct a dataframe from the papers.csv in the nips-paper folder
nips_papers = pandas.read_csv("nips-papers/papers.csv")
#lets look at some of the data
print("Number of rows : " + str(nips_papers.shape[0]))
nips_papers.head()
#Its clear we need to do some cleaning, we want only rows which have abstracts

Number of rows : 2999


Unnamed: 0.1,Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
1,2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
2,3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
3,4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."
4,5,1002,1994,Using a neural net to instantiate a deformable...,,1002-using-a-neural-net-to-instantiate-a-defor...,Abstract Missing,U sing a neural net to instantiate a\ndeformab...


In [21]:
#we select the rows only when the column "abstract" is not equal to the default value 
nips_papers = nips_papers[nips_papers['abstract'] != "Abstract Missing"]
print("Number of rows : " + str(nips_papers.shape[0]))
nips_papers.head()

Number of rows : 413


Unnamed: 0.1,Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
940,941,1861,2000,Algorithms for Non-negative Matrix Factorization,,1861-algorithms-for-non-negative-matrix-factor...,Non-negative matrix factorization (NMF) has pr...,Algorithms for Non-negative Matrix\nFactorizat...
1066,1067,1975,2001,Characterizing Neural Gain Control using Spike...,,1975-characterizing-neural-gain-control-using-...,Spike-triggered averaging techniques are effec...,Characterizing neural gain control using\nspik...
2383,2384,3163,2007,Competition Adds Complexity,,3163-competition-adds-complexity.pdf,It is known that determinining whether a DEC-P...,Competition adds complexity\n\nJudy Goldsmith\...
2384,2385,3164,2007,Efficient Principled Learning of Thin Junction...,,3164-efficient-principled-learning-of-thin-jun...,We present the first truly polynomial algorith...,Efficient Principled Learning of Thin Junction...
2387,2388,3167,2007,Regularized Boost for Semi-Supervised Learning,,3167-regularized-boost-for-semi-supervised-lea...,Semi-supervised inductive learning concerns ho...,Regularized Boost for Semi-Supervised Learning...


In [22]:
#now lets remove all but the data we want
abstract_dataset = pandas.DataFrame(columns=['abstract', 'word_count'])
abstract_dataset['abstract'] = nips_papers['abstract']
abstract_dataset.reset_index(inplace = True)
abstract_dataset['word_count'] = abstract_dataset['abstract'].apply(lambda x: len(x.split(" ")))
total_wordcount = abstract_dataset['word_count'].sum()
print("sum \t" + str(total_wordcount))
print(abstract_dataset.word_count.describe())
abstract_dataset.head()

sum 	57452
count    413.000000
mean     139.108959
std       44.134286
min       19.000000
25%      107.000000
50%      133.000000
75%      168.000000
max      290.000000
Name: word_count, dtype: float64


Unnamed: 0,index,abstract,word_count
0,940,Non-negative matrix factorization (NMF) has pr...,107
1,1066,Spike-triggered averaging techniques are effec...,81
2,2383,It is known that determinining whether a DEC-P...,67
3,2384,We present the first truly polynomial algorith...,143
4,2387,Semi-supervised inductive learning concerns ho...,119


<img src = "./toomuch.jpg">

<h2> Looking at our data</h2>
<p> lets look at the most common words without cleaning our data </p>

In [23]:
abstract_raw_word_counts = pandas.Series(''.join(abstract_dataset['abstract']).split()).value_counts()

In [24]:
#we can see we can't really get useful informative words the way that the data currently is
abstract_raw_word_counts

the                     2905
of                      2196
a                       1677
and                     1293
to                      1272
in                       994
is                       804
that                     795
for                      747
on                       515
We                       487
we                       460
with                     416
as                       367
are                      363
this                     357
an                       350
learning                 309
be                       305
by                       305
can                      302
which                    299
from                     265
model                    253
The                      241
data                     239
In                       226
show                     211
algorithm                206
our                      190
                        ... 
MB.                        1
($p$                       1
smoother                   1
Performance   

<h2> Cleaning our data </h2>
<p> To clean our data we need to do the following </p>
<ul>
    <li> remove punctuation and special symbols </li>
    <li> remove tags </li>
    <li> remove digits </li>
    <li> make all characters lower case </li>
<ul>

In [25]:
#we loop through all our rows
corpus = ""
for abstract in abstract_dataset['abstract']:
    #remove all punctuation
    abstract = re.sub('[^a-zA-Z]', ' ', abstract)
    #empty tags 
    abstract = re.sub("&lt;/?.*?&gt;", " &lt;&gt; ", abstract)
    #remove special characters and digits
    abstract = re.sub("(\\d|\\W)+"," ", abstract)
    #append to corpus
    corpus = " ".join((corpus, abstract))
corpus[:1000]  

' Non negative matrix factorization NMF has previously been shown to be a useful decomposition for multivariate data Two different multi plicative algorithms for NMF are analyzed They differ only slightly in the multiplicative factor used in the update rules One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback Leibler divergence The monotonic convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the Expectation Maximization algorithm The algorithms can also be interpreted as diag onally rescaled gradient descent where the rescaling factor is optimally chosen to ensure convergence  Spike triggered averaging techniques are effective for linear characterization of neural responses But neurons exhibit important nonlinear behaviors such as gain control that are not captured by such analyses We describe a spike triggered covariance method for retriev

<h2>Normalization and Lemmatization </h2>
<p> Here we remove all stopwords and get the lemmas for our words</p>

In [26]:
corpus = []
#initialize the lemmatizer
lem = WordNetLemmatizer()
#create a set of stopwords
stop_words = set(stopwords.words("english"))
for abstract in abstract_dataset['abstract']:
    #remove all punctuation
    abstract = re.sub('[^a-zA-Z]', ' ', abstract)
    #empty tags 
    abstract = re.sub("&lt;/?.*?&gt;", " &lt;&gt; ", abstract)
    #remove special characters and digits
    abstract = re.sub("(\\d|\\W)+"," ", abstract)
    #make all chars lowercase
    abstract.lower()
    #create a list of words 
    words = abstract.split()
    text = " ".join([lem.lemmatize(word) for word in words if \
                    word not in stop_words])
    corpus.append(text)
# we have now generated a normalized corpus with only word lemmas and 
corpus[:2]


['Non negative matrix factorization NMF previously shown useful decomposition multivariate data Two different multi plicative algorithm NMF analyzed They differ slightly multiplicative factor used update rule One algorithm shown minimize conventional least square error minimizes generalized Kullback Leibler divergence The monotonic convergence algorithm proven using auxiliary func tion analogous used proving convergence Expectation Maximization algorithm The algorithm also interpreted diag onally rescaled gradient descent rescaling factor optimally chosen ensure convergence',
 'Spike triggered averaging technique effective linear characterization neural response But neuron exhibit important nonlinear behavior gain control captured analysis We describe spike triggered covariance method retrieving suppressive component gain control signal neuron We demonstrate method simulation retinal ganglion cell data Analysis physiological data reveals significant suppressive ax explains neural nonli

<h2> Create the count vector </h2>

In [27]:
cv = CountVectorizer(stop_words= stop_words, max_features = 10000, ngram_range= (1,1))

In [28]:
#load the corpus into the CountVectorizer
cv = cv.fit(corpus)
frequency_matrix = cv.transform(corpus)
frequency_matrix

<413x4534 sparse matrix of type '<class 'numpy.int64'>'
	with 26681 stored elements in Compressed Sparse Row format>

In [29]:
sum_words = frequency_matrix.sum(axis = 0)

In [30]:
word_frequencies = [(word, sum_words[0,idx]) for word, idx in cv.vocabulary_.items()]

In [31]:
word_frequencies = sorted(word_frequencies, key = lambda x: x[1], reverse = True)
word_frequencies[:20]

[('model', 485),
 ('algorithm', 413),
 ('learning', 398),
 ('data', 356),
 ('method', 338),
 ('problem', 304),
 ('show', 234),
 ('based', 234),
 ('approach', 206),
 ('function', 185),
 ('using', 167),
 ('result', 164),
 ('paper', 149),
 ('task', 142),
 ('image', 141),
 ('set', 134),
 ('feature', 133),
 ('kernel', 130),
 ('present', 129),
 ('new', 126)]