# Overview of Text Preprocessing Techniques

This Jupyter notebook provides a concise overview of essential text preprocessing techniques using Python libraries, including regular expression operations, punctuation removal, stopword filtering, stemming, and tokenization, specifically tailored for social media text data.

## Libraries Used
- `re`: Regular expression operations
- `string`: String operations
- `nltk.corpus.stopwords`: Stopword filtering
- `nltk.stem.PorterStemmer`: Stemming
- `nltk.tokenize.TweetTokenizer`: Tokenization

## Example Usage
The notebook includes an example that demonstrates:
- Finding words that match a pattern using `re`
- Removing punctuation using `string`
- Filtering out stop words using `nltk.corpus.stopwords`
- Stemming words using `nltk.stem.PorterStemmer`
- Tokenizing text using `nltk.tokenize.TweetTokenizer`

In [6]:
# Importing necessary libraries
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

# Example text
text = "I am reading a tweet! which is making me happy, and extra tired! #example :)"

# Using re library to find all words starting with 'ex'
words_with_ex = re.findall(r'\bex\w+', text)
print(words_with_ex)  # Output: ['extra', 'example']

# Using string library to remove punctuation
text_without_punctuation = text.translate(str.maketrans('', '', string.punctuation))
print(text_without_punctuation)  # Output: 'I am reading a tweet which is making me happy and extra tired example'

# Using stopwords to filter out common words
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in text_without_punctuation.split() if word.lower() not in stop_words]
print(filtered_words)  # Output: ['reading', 'tweet', 'making', 'happy', 'extra', 'tired', 'example']

# Using PorterStemmer to stem words
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words)  # Output: ['read', 'tweet', 'make', 'happi', 'extra', 'tire', 'exampl']

# Using TweetTokenizer to tokenize text
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)  
# Output: ['I', 'am', 'reading', 'a', 'tweet', '!', 'which', 'is', 'making', 'me', 'happy', ',', 'and', 'extra', 'tired', '!', '#example', ':)']


['extra', 'example']
I am reading a tweet which is making me happy and extra tired example 
['reading', 'tweet', 'making', 'happy', 'extra', 'tired', 'example']
['read', 'tweet', 'make', 'happi', 'extra', 'tire', 'exampl']
['I', 'am', 'reading', 'a', 'tweet', '!', 'which', 'is', 'making', 'me', 'happy', ',', 'and', 'extra', 'tired', '!', '#example', ':)']
