# Text Preprocessing Assignments

## 0. Create a New Environment

Command line code to execute in the Terminal (Mac) or Anaconda Prompt (PC):

#### 1. view, create and switch environments
```
conda env list
conda create --name nlp_basics
conda env list
conda activate nlp_basics
```

#### 2. install and view packages
```
conda install python jupyter notebook pandas matplotlib scikit-learn spacy
conda list
```

#### 3. additional spacy download
```
python -m spacy download en_core_web_sm
```

## 1. Text Preprocessing with Pandas

1. Read the _childrens_books.csv_ file into a Jupyter Notebook
2. Within the Description column:
* Make all the text lowercase
* Remove all \xa0 characters
* Remove all punctuation

In [1]:
from dotenv import load_dotenv, find_dotenv
from pathlib import Path
import os

# 1) โหลดไฟล์ .env (ค้นหาเริ่มจากโฟลเดอร์โน้ตบุ๊ก)
load_dotenv(find_dotenv(usecwd=True))

# 2) ดึงค่าและจัดการเครื่องหมายคำพูด/ขยาย ~
raw = os.getenv("FILE_PATH")
if not raw:
    raise RuntimeError("ไม่พบตัวแปร FILE_PATH ใน .env")
base_dir = Path(raw.strip().strip('"\'' )).expanduser().resolve()

print("BASE DIR =", base_dir)
print("มีอยู่จริงไหม? ->", base_dir.exists())

# 3) ใช้ join แบบ pathlib (ไม่ต้องห่วงเรื่อง / ท้าย path)
some_file = base_dir / "dataset.csv"    # ตัวอย่างไฟล์
print(some_file)

BASE DIR = /Users/akanitkwangkaew/Documents/Data-Projects/nlp/On_the_Git
มีอยู่จริงไหม? -> True
/Users/akanitkwangkaew/Documents/Data-Projects/nlp/On_the_Git/dataset.csv


In [2]:
import pandas as pd

data_file = base_dir / "Data/childrens_books.csv"
df =  pd.read_csv(data_file)
df

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things Are follows Max, a young..."
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,The Very Hungry Caterpillar tells the story of...
2,3,The Giving Tree,Shel Silverstein,1964,4.38,The Giving Tree is a touching and bittersweet ...
3,4,Green Eggs and Ham,Dr. Seuss,1960,4.31,"In Green Eggs and Ham, Sam-I-Am tries to convi..."
4,5,Goodnight Moon,Margaret Wise Brown,1947,4.31,"Goodnight Moon is a gentle, rhythmic bedtime s..."
...,...,...,...,...,...,...
95,96,Stone Soup,Jon J. Muth,2003,4.18,Stone Soup is a classic folktale retold by Jon...
96,97,A Light in the Attic,Shel Silverstein,1981,4.36,A Light in the Attic is a collection of quirky...
97,98,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling,1999,4.58,Harry Potter and the Prisoner of Azkaban is th...
98,99,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,1998,4.43,Harry Potter and the Chamber of Secrets is the...


* Make all the text lowercase
* Remove all \xa0 characters
* Remove all punctuation

In [4]:
df['Description_clean'] = df['Description'].str.lower()
df['Description_clean'].head()

0    where the wild things are follows max, a young...
1    the very hungry caterpillar tells the story of...
2    the giving tree is a touching and bittersweet ...
3    in green eggs and ham, sam-i-am tries to convi...
4    goodnight moon is a gentle, rhythmic bedtime s...
Name: Description_clean, dtype: object

In [8]:
# remove text between brackets, including the brackets
df['Description_clean'] = df['Description_clean'].str.replace(r'\[.*?\]', '', regex=True)
df['Description_clean']

0     where the wild things are follows max, a young...
1     the very hungry caterpillar tells the story of...
2     the giving tree is a touching and bittersweet ...
3     in green eggs and ham, sam-i-am tries to convi...
4     goodnight moon is a gentle, rhythmic bedtime s...
                            ...                        
95    stone soup is a classic folktale retold by jon...
96    a light in the attic is a collection of quirky...
97    harry potter and the prisoner of azkaban is th...
98    harry potter and the chamber of secrets is the...
99    the three billy goats gruff is a retelling of ...
Name: Description_clean, Length: 100, dtype: object

In [9]:
# remove punctuation
df['Description_clean'] = df['Description_clean'].str.replace(r'[^\w\s]', '', regex=True)
df['Description_clean']

0     where the wild things are follows max a young ...
1     the very hungry caterpillar tells the story of...
2     the giving tree is a touching and bittersweet ...
3     in green eggs and ham samiam tries to convince...
4     goodnight moon is a gentle rhythmic bedtime st...
                            ...                        
95    stone soup is a classic folktale retold by jon...
96    a light in the attic is a collection of quirky...
97    harry potter and the prisoner of azkaban is th...
98    harry potter and the chamber of secrets is the...
99    the three billy goats gruff is a retelling of ...
Name: Description_clean, Length: 100, dtype: object

In [11]:
# remove all \xa0 characters
df['Description_clean'] = df['Description_clean'].str.replace('\xa0', ' ')
df.head(2)

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description,Description_clean
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things Are follows Max, a young...",where the wild things are follows max a young ...
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,The Very Hungry Caterpillar tells the story of...,the very hungry caterpillar tells the story of...


## 2. Text Preprocessing with spaCy

In addition to the lowercasing and special character removal from the previous assignment, within the cleaned Description column:
* Tokenize the text
* Lemmatize the text
* Remove stop words

## 3. Count Vectorizer

1. Vectorize the cleaned and normalized text using Count Vectorizer with the default parameters
2. Modify the Count Vectorizer parameters to reduce the number of columns:
* Remove stop words
* Set a minimum document frequency of 10%
3. Use the updated Count Vectorizer to identify the:
* Top 10 most common terms
* Top 10 least common terms that appear in at least 10% of the documents
4. Create a horizontal bar chart of the top 10 most common terms

## 4. TF-IDF Vectorizer

1. Vectorize the cleaned and normalized text using TF-IDF Vectorizer with the default parameters
2. Modify the TF-IDF Vectorizer parameters to reduce the number of columns:
* Remove stop words
* Set a minimum document frequency of 10%
* Set a maximum document frequency of 50%
3. Using the updated TF-IDF Vectorizer, create a  horizontal bar chart of the top 10 most highly weighted terms
4. Compare the Count Vectorizer bar chart from the previous assignment with the TF-IDF Vectorizer bar chart and note the differences in the top term lists