## Problem Statement 1: Natural Language Processing (NLP)

**Problem:**
Implement a function to preprocess and tokenize text data. 

**Requirements:**
Implement in Python using libraries like NLTK or spaCy.
Handle edge cases such as punctuation, stop words, and different cases. 

**Evaluation Criteria:**
Correctness of the preprocessing steps.
Efficiency and readability of the code.
Clean and structured code with appropriate comments.

In [6]:
# Import spaCy
import spacy

In [7]:
# Load spaCy's Large english language model
nlp = spacy.load("en_core_web_lg")

In [8]:
# Function of processing the input text

def preprocess_and_tokenize(text):
    """
    This function takes string as an input parameter and convertes into nlp object and 
    return a list of prepocessed and tokenized words.
    """
    
    # Process the text with spaCy
    doc = nlp(text)   # Creating an NLP object
    
    # Tokenize and filter out punctuation and stop words and normalizing the tokens into small letters
    filtered_tokens = []
    
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.text.lower()) 
    
    return filtered_tokens

In [9]:
# Example usage
sample_text = "Hello, world! This is a sample text for preprocessing and tokenization."

# Storing the result
preprocessed_tokens = preprocess_and_tokenize(sample_text)

# Display the result
print(preprocessed_tokens)

['hello', 'world', 'sample', 'text', 'preprocessing', 'tokenization']


### Interpretation:
#### Punctuation has been removed (",", "!", "."")
#### Stop words has been removed ("This", "is", "a", "for", "and")
#### Texts are normalized ("Hello" --> "hello")

### NOTE : As the problem statement did not mention about stemming or lemmatization, I have not performed that in the above code.But in some NLP application it is important to perform stemming/lemmatization.

In [12]:
def text_preprocess(text):
    """
    This function takes string as an input parameter and convertes into nlp object and 
    return a list of prepocessed and tokenized words.
    """
    
    # Process the text with spaCy
    doc = nlp(text)   # Creating an NLP object
    
    # Tokenize and filter out punctuation and stop words and normalizing the tokens into small letters
    filtered_tokens = []
    
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_.lower()) # Lemmatize
    
    return filtered_tokens

In [14]:
# Example usage
text = "Coding is never boring."

# Storing the result
preprocessed_tokens = text_preprocess(text)

# Display the result
print(preprocessed_tokens)

['code', 'boring']
