# Cleaning textual data
Popular python libraries used in cleaning textual data are **NLTK, re, sklearn, pandas**.

Structured : Data is organized into pre-defined structure like a table of database - with rows and columns.
Unstructured data : Data does not have a pre-defined structure. Eg: emails, a bunch of satellite images, text if speeches.

Converting Unstructured Data into Structured Form - 
1. Bag of Words: A method for text representation that lists all words in a document, disregarding order, to simplify text analysis.  
e.g: Bayesian Spam Filter utilizes the bag of words model to classify emails by analyzing word frequency patterns to filter spam. Further understanding of Naive Bayes : https://www.youtube.com/watch?v=O2L2Uv9pdDA

2. N-grams: An extension of the bag of words model, n-grams analyze sequences of words to capture spatial relationships, providing more context.

3. Semantic Methods: These methods interpret text by understanding language structure and grammar, allowing for deeper contextual insights.  
e.g: Name Entity Identification: A semantic technique for recognizing and categorizing key entities like names of people, places, and organizations within text.

Common text preprocessing techniques:
1. Stop-word Removal: This involves eliminating common words that don’t add significant meaning to the text, such as “the”, “and”, and “in”, to focus on more meaningful words.
2. Stemming: This reduces words to their base or root form, treating different word forms as the same entity, which is useful in text analysis.

    **Note : Stemming and lemmatization are both text normalization techniques in Natural Language Processing (NLP) that reduce words to their root form, but they differ in their approach. Stemming is a simpler, rule-based process that often truncates word endings, potentially leading to non-dictionary words. Lemmatization, on the other hand, considers the context and morphological analysis of a word to return its dictionary-based root form (lemma), which is always a valid word.** 


3. Case Conversion: This involves changing all text to a uniform case, either lower or upper, ensuring consistent treatment of words regardless of their original case.
4. Punctuation and White Space Removal: This step removes punctuation and extra white spaces since they do not contribute to the meaning in a bag of words model, preventing inconsistencies.
5. Number Removal: This involves removing numbers when they do not add significant meaning to the text, simplifying the analysis.
6. Word Frequency and Bag of Words Model: This technique represents text data by counting word frequency in a document, aiding in text classification and clustering by providing a simple way to quantify text data.  
These techniques aim to clean and prepare text data for analysis by simplifying the text while retaining its core meaning, enhancing the efficiency of natural language processing tasks.

Sentiment Analysis, which involves analyzing texts to understand the sentiment expressed. The main approaches include lexicon-based and machine learning-based techniques. Sentiment Analysis can be applied in areas like product and movie reviews and is useful for monitoring customer feedback and market research.

a lexicon refers to a predefined list of words, each associated with specific sentiments. It is used in sentiment analysis, where each word within a text is replaced with its corresponding sentiment from the lexicon. This process helps summarize the overall sentiment expressed in the text. The effectiveness of this approach heavily depends on the quality of the lexicon used, and one of its challenges is dealing with words that have multiple interpretations depending on the context.

| Pattern Syntax | Description |
|---------------|-------------|
| `[a-z]` | Matches any lowercase letter |
| `[A-Z]` | Matches any uppercase letter |
| `[0-9]` | Matches any digit |
| `[^0-9]` | Matches any character except digits 0-9 |
| `[^A-Za-z]` | Matches any character except letters |
| `[]+` | One or more occurrences of characters in brackets |
| `[a-zA-Z0-9]+` | One or more alphanumeric characters |
| `[a-fA-F0-9]+` | One or more hexadecimal digits |
| \W+ => [^a-zA-Z0-9_] | This is a special regex character class that matches any non-word characters|

| String Methods | Description |
|----------------|-------------|
| `input_string.lower()` | Converts string to lowercase |
| `str.lower()` | Converts string to lowercase |
| `input_string.strip()` | Removes leading/trailing whitespace |
| `str.strip()` | Removes leading/trailing whitespace |

In [2]:
# load text
filename = 'data/metamorphosis_clean.txt'
with open(filename, 'r', encoding='utf-8-sig') as file:
    text = file.read()
print(text[:200])

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin. He lay on his
armour-like back, and if he lifted his head a little he could se


In [3]:
# split into words by white space
words = text.split()
print(words[:100])

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '“What’s', 'happened', 'to', 'me?”', 'he', 'thought.', 'It', 'wasn’t', 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']


In [4]:
# using re package to split
import re

words = re.split(r'\W+', text)
print(words[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']
