[back](./00-index.ipynb)

---
## `NLP Data Cleaning`

### `Imports and setup`

In [1]:
import numpy as np
from collections import Counter
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import SnowballStemmer
import string
from scipy.spatial.distance import pdist, squareform
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline


In [2]:
nltk.download('stopwords')
nltk.download('punkt')
stops = set(nltk.corpus.stopwords.words('english'))


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/goutham/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/goutham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
corpus = ["Jeff stole my octopus sandwich.",
          "'Help!' I sobbed, sandwichlessly.",
          "'Drop the sandwiches!' said the sandwich police."]


### `How do I turn a corpus of documents into a feature matrix?`
### `Words --> numbers?????`

**Corpus:** A collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

```
[
  "Jeff stole my octopus sandwich.",
  "'Help!' I sobbed, sandwichlessly.",
  "'Drop the sandwiches!' said the sandwich police."
]
```

`NLTK` is a very famous library in python, which is used for pre-processing, it has some model components.

But we'll primarily use it for pre-processing and `SKLearn` will be used for model, fitting, evaluation etc.

Now, lets inspect the `stopwords` we obtained from `nltk`

In [4]:
print(stops)

{'at', 'not', 'than', 'me', 'an', 'because', 'through', 'more', 'no', 'few', 'you', 'here', 'or', 'weren', 'doesn', 'have', 'most', 'for', "won't", 'same', 'ourselves', 'its', 'theirs', 'our', "shan't", 'won', 'before', 'is', "shouldn't", "she's", 'from', 'does', 'so', 'ma', 'under', 'should', 'just', 'very', 'aren', 'hasn', 'she', 'where', 'shouldn', "couldn't", 'do', 'as', "don't", "mightn't", 'been', "hasn't", 'with', 'each', 'being', 'has', 're', "weren't", 'their', 'we', "you'd", 'him', "you'll", 'to', 'further', 'some', 'how', 'below', 'nor', 'shan', 'ours', 'by', 'my', 'myself', 'this', 'did', 'these', 'after', 'other', 'them', 'if', "you're", 'hers', 'and', 'that', 'into', 'why', 'yourselves', 'mightn', 'whom', 've', 'be', 'too', 'didn', "should've", "mustn't", 'until', 'about', 'down', 'on', 'while', 'don', 'himself', 'his', 'own', 'themselves', 'your', 'when', "doesn't", 'such', 'the', 'doing', 'but', 's', 'needn', "that'll", 'can', 'her', 'who', 'which', 'of', 'i', 'all', 'a

### `Tokenize or document our corpus`

In [5]:
def our_tokenizer(doc, stops=None, stemmer=None):
  doc = word_tokenize(doc.lower())
  tokens = [''.join([char for char in tok if char not in string.punctuation]) for tok in doc]
  tokens = [tok for tok in tokens if tok]
  if stops:
    tokens = [tok for tok in tokens if(tok not in stops)]
  if stemmer:
    tokens = [stemmer.stem(tok) for tok in tokens]
  return tokens

In [6]:
tokenized_docs = [our_tokenizer(doc) for doc in corpus]
tokenized_docs

[['jeff', 'stole', 'my', 'octopus', 'sandwich'],
 ['help', 'i', 'sobbed', 'sandwichlessly'],
 ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']]

After running the `our_tokenizer()`, we see that we get a list of lists as an output, where the original corpus was a list of sentences and now we have list of lists.

Each document is converted into a list, and each word now is a token in the list inside of the output list.

After this, we pretty much have some standardized steps that we need to take like:
- We need to convert all the tokens to lowercase (or uppercase), and in our case, we convert to lowercase.
- Also, we are going to remove all punctuations like `. , !` etc.

In [7]:
'i' in stops

True

### `Remove stop words`

In [8]:
tokenized_docs

[['jeff', 'stole', 'my', 'octopus', 'sandwich'],
 ['help', 'i', 'sobbed', 'sandwichlessly'],
 ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']]

In [9]:
tokenized_docs = [our_tokenizer(doc, stops=stops) for doc in corpus]
tokenized_docs

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['help', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwiches', 'said', 'sandwich', 'police']]

### `Stemming / Lemmatization`

In [10]:
tokenized_docs

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['help', 'sobbed', 'sandwichlessly'],
 ['drop', 'sandwiches', 'said', 'sandwich', 'police']]

In [11]:
tokenized_docs = [our_tokenizer(doc, stops=stops, stemmer=SnowballStemmer('english')) for doc in corpus]
tokenized_docs

[['jeff', 'stole', 'octopus', 'sandwich'],
 ['help', 'sob', 'sandwichless'],
 ['drop', 'sandwich', 'said', 'sandwich', 'polic']]

### `Conclusion`


---
[next](./02-count-vectorization-and-tfidf.ipynb)