# Module 2 â€“ In-Class Exercise
Working with Text in Python (Student Version)

# Module 2 â€“ In-Class Exercise
## Working with Text in Python

### Learning Objectives
- Use Python string methods for text manipulation
- Handle text encoding issues
- Normalize text for preprocessing
- Prepare raw text for analysis

Complete all tasks below. Show your code and output for each section.


## Part 1 â€“ Basic String Operations

### Dataset


In [None]:
text = "  Python is AMAZING!! Python is powerful, flexible, and FUN.   "
text

### Tasks

1. Remove leading and trailing whitespace.
2. Convert the entire string to lowercase.
3. Replace "AMAZING!!" with "amazing".
4. Count how many times the word "python" appears.
5. Split the sentence into individual words.
6. Remove punctuation from the string.

Write your code below.

### Removing Punctuation in Python with `str.translate()` and `str.maketrans()`

In text preprocessing, itâ€™s common to remove punctuation from strings so that words can be analyzed cleanly. Python provides a simple way to do this using `str.translate()` and `str.maketrans()`.

#### Example:

```python
import string

text = "Hello, world! Python is fun."
# Remove punctuation
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)

# How It Works:

string.punctuation

A pre-defined string containing all punctuation characters:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


These are the characters we want to remove from our text.

str.maketrans('', '', string.punctuation)

str.maketrans(x, y, z) creates a translation table:

x â†’ characters to replace

y â†’ characters to replace with

z â†’ characters to delete

Here, x='' and y='' â†’ we do not replace any characters.

z=string.punctuation â†’ delete all punctuation characters.

text.translate(...)

Applies the translation table to the string.

Removes all characters listed in z (punctuation), keeping letters, numbers, and spaces intact.

# Why is this useful in text mining?

Ensures that punctuation does not interfere with tokenization or word counts.

Standardizes text for analysis, making operations like lowercasing, stemming, or vectorization more accurate.

In [None]:
# Your code here (Student Work Area)


## Part 2 â€“ Advanced String Manipulation

### Dataset


In [None]:
sentence = "Data Science, data science, DATA science!"
sentence

### Tasks

1. Normalize the text so all variations become identical.
2. Remove duplicate words.
3. Output the cleaned sentence as:

data science

Optional Challenge:
- Use a set to remove duplicates while preserving word order.


# Text Normalization â€“ In-Class Demonstration

## What is Text Normalization?

Text normalization is the process of **transforming text into a standard, consistent format** so it can be analyzed more easily.

Key points:

- Words can appear in many forms (e.g., `"Data"`, `"data"`, `"DATA"`). Normalization treats them as the same.
- Text may include extra spaces, punctuation, emojis, or repeated characters.
- Normalization prepares raw text for **tokenization, word counting, machine learning, or NLP tasks**.

**Examples:**

| Original Text                  | Normalized Text        |
|--------------------------------|----------------------|
| `"Data Science, data science"` | `"data science data science"` |
| `"I LOVED this movie!!!"`      | `"i loved this movie"` |
| `"CafÃ© MÃ¼nster"`               | `"cafÃ© mÃ¼nster"`       |

---

## Why Normalization Matters

- **Consistency:** Avoid treating `"Python"` and `"python"` as different words.
- **Reduces noise:** Punctuation, extra spaces, or repeated characters do not interfere with analysis.
- **Reliable feature extraction:** Vectorization, word counts, and sentiment analysis work better on clean text.

---

## Step-by-Step Demonstration in Python

### Step 1: Start with messy text
```python
raw_text = "I LOVED this movie!!! It was soooo good!!! ðŸ˜„ðŸ˜„"
print(raw_text)
```
*Output:*  
```
I LOVED this movie!!! It was soooo good!!! ðŸ˜„ðŸ˜„
```

### Step 2: Convert to lowercase
```python
text_lower = raw_text.lower()
print(text_lower)
```
*Output:*  
```
i loved this movie!!! it was soooo good!!! ðŸ˜„ðŸ˜„
```
- All words are now lowercase.

### Step 3: Remove punctuation
```python
import string
text_no_punct = text_lower.translate(str.maketrans('', '', string.punctuation))
print(text_no_punct)
```
*Output:*  
```
i loved this movie it was soooo good ðŸ˜„ðŸ˜„
```
- Punctuation like `!!!` is removed.

### Step 4: Reduce repeated characters
```python
import re
text_normalized = re.sub(r'(.)\1{2,}', r'\1\1', text_no_punct)
print(text_normalized)
```
*Output:*  
```
i loved this movie it was soo good ðŸ˜„ðŸ˜„
```
- `"soooo"` â†’ `"soo"` to standardize text.

### Step 5: Remove emojis (optional)
```python
text_clean = text_normalized.encode('ascii', 'ignore').decode('ascii')
print(text_clean)
```
*Output:*  
```
i loved this movie it was soo good
```
- Non-ASCII characters like emojis are removed.

---

## practice for Students

- **Questions:**
  - "What would happen if we skip lowercase conversion?"  
  - "How might repeated characters affect sentiment analysis?"  
- **Hands-On:** Let normalize your own example sentences.  
- **Reflection:** Consider what information might be lost during normalization (emphasis, emojis, special punctuation).

---

## Quick Summary

**Normalization Steps in Practice:**

1. Convert text to lowercase.
2. Remove punctuation.
3. Remove extra whitespace.
4. Reduce repeated characters.
5. Remove or standardize non-standard characters (e.g., emojis, accented letters).

Note: Clean, normalized text is ready for NLP analysis, feature extraction, and machine learning workflows.

In [None]:
# Your code here (Student work Area)


## Part 3 â€“ Text Encoding

### Dataset


In [None]:
text = "CafÃ© MÃ¼nster â€” naÃ¯ve faÃ§ade"
text

### Tasks

1. Print the text normally.
2. Encode the string in UTF-8.
3. Decode it back to readable format.
4. Attempt ASCII encoding. What happens?

# Text Encoding and Decoding â€“ In-Class Demonstration

## What is Text Encoding?

Text encoding is the process of converting characters into bytes so that computers can store or transmit text.

- **UTF-8:** Can represent almost all characters from all languages.
- **ASCII:** Only represents 128 basic characters (English letters, digits, and some symbols).
- Encoding ensures text can be stored or sent, while decoding converts it back to readable form.

---

## Step-by-Step Example

### Sample Text
```python
text = "CafÃ© MÃ¼nster â€” naÃ¯ve faÃ§ade"
print(text)
```
*Output:*  
```
CafÃ© MÃ¼nster â€” naÃ¯ve faÃ§ade
```

---

### Step 1: Encode in UTF-8
```python
utf8_bytes = text.encode('utf-8')
print(utf8_bytes)
```
*Output (bytes):*  
```
b'Caf\xc3\xa9 M\xc3\xbcnster \xe2\x80\x94 na\xefve fa\xc3\xa7ade'
```
- UTF-8 converts each character into bytes.  
- Non-ASCII characters like `Ã©`, `Ã¼`, `â€”`, `Ã§` are represented as multiple bytes.

---

### Step 2: Decode back to readable text
```python
decoded_text = utf8_bytes.decode('utf-8')
print(decoded_text)
```
*Output:*  
```
CafÃ© MÃ¼nster â€” naÃ¯ve faÃ§ade
```
- Decoding converts the bytes back into human-readable text.

---

### Step 3: Attempt ASCII encoding
```python
ascii_bytes = text.encode('ascii')
```
- **Result:** Python raises a `UnicodeEncodeError` because ASCII cannot represent characters like `Ã©`, `Ã¼`, `â€”`, `Ã§`.  

**Explanation:**  
- ASCII supports only basic English letters, digits, and some punctuation.  
- Non-English or special characters cannot be encoded in ASCII without data loss.  
- You could handle this with options like `errors='ignore'` or `errors='replace'` if needed.

---

## Key Takeaways

1. **Encoding** converts text to bytes for storage or transmission.  
2. **UTF-8** can handle almost all characters; **ASCII** is very limited.  
3. **Decoding** is necessary to convert bytes back to readable text.  
4. Use UTF-8 for text mining and NLP to avoid errors with special characters.

---

## Optional Demo: ASCII with error handling
```python
ascii_bytes_safe = text.encode('ascii', errors='ignore')
ascii_text_safe = ascii_bytes_safe.decode('ascii')
print(ascii_text_safe)
```
*Output:*  
```
Caf Mnster  nave faade
```
- Non-ASCII characters are removed, which may **cause data loss**.

In [None]:
# Your code here (Student work Area)


## Part 4 â€“ Text Normalization

### Dataset


In [None]:
raw_text = "I LOVED this movie!!! It was soooo good!!! ðŸ˜„ðŸ˜„"
raw_text

### Tasks

1. Convert text to lowercase.
2. Remove repeated exclamation marks.
3. Remove emojis (basic method).
4. Reduce repeated characters (e.g., "soooo" â†’ "so").
5. Output a cleaned version suitable for analysis.

Bonus:
- Use regular expressions (re) to reduce repeated characters.

In [None]:
# Your code here (Student work Area)


## Reflection Questions

1. Why is normalization important before text analysis?
2. What information might be lost during cleaning?
3. When might you not want to normalize aggressively?

Write 3â€“5 sentences below.

# Your response here.


## Submission Instructions

- Ensure all cells run without errors.
- Display outputs for each section.
- Include written reflection answers.
- Submit the completed notebook (.ipynb) to Canvas.