# Part1

In [3]:
import pandas as pd
# Load the CSV file
# Assuming the CSV has a column named 'text'
df = pd.read_csv('airlines.csv')
print(df.head(50))


   IATA_CODE                       AIRLINE
0         UA         United Air Lines Inc.
1         AA        American Airlines Inc.
2         US               US Airways Inc.
3         F9        Frontier Airlines Inc.
4         B6               JetBlue Airways
5         OO         Skywest Airlines Inc.
6         AS          Alaska Airlines Inc.
7         NK              Spirit Air Lines
8         WN        Southwest Airlines Co.
9         DL          Delta Air Lines Inc.
10        EV   Atlantic Southeast Airlines
11        HA        Hawaiian Airlines Inc.
12        MQ  American Eagle Airlines Inc.
13        VX                Virgin America


# Part 2

In [4]:
import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Download necessary NLTK data
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Choose a text corpus, for example, "Alice in Wonderland" by Lewis Carroll
text = gutenberg.raw('carroll-alice.txt')


[nltk_data] Downloading package gutenberg to C:\Users\Chief
[nltk_data]     Oggy\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Chief
[nltk_data]     Oggy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Chief
[nltk_data]     Oggy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Chief
[nltk_data]     Oggy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Sentence Tokenization

In [5]:
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Number of sentences:", len(sentences))
print("First 5 sentences:", sentences[:5])

Number of sentences: 1625
First 5 sentences: ["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.", "Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'", 'So she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.', "There was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!", 'Oh dear!']


## Tokenization

In [6]:
# Word Tokenization
words = word_tokenize(text)
print("Number of words:", len(words))
print("First 20 words:", words[:20])

Number of words: 33494
First 20 words: ['[', 'Alice', "'s", 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning']


## Stemming

In [7]:
# Initialize the PorterStemmer
ps = PorterStemmer()

# Apply stemming to the words
stemmed_words = [ps.stem(word) for word in words]
print("First 20 stemmed words:", stemmed_words[:20])


First 20 stemmed words: ['[', 'alic', "'s", 'adventur', 'in', 'wonderland', 'by', 'lewi', 'carrol', '1865', ']', 'chapter', 'i', '.', 'down', 'the', 'rabbit-hol', 'alic', 'wa', 'begin']


## Lemmatization

In [8]:
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization to the words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("First 20 lemmatized words:", lemmatized_words[:20])


First 20 lemmatized words: ['[', 'Alice', "'s", 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'wa', 'beginning']


# Discussion

**Impact of Tokenization:**

1. Sentence tokenization helps in breaking down the text into manageable units for further processing.
2. Word tokenization is essential for tasks like frequency analysis, stemming, and lemmatization.

**Impact of Stemming:**
1. Stemming can help reduce vocabulary size and normalize words to their base forms.
2. It can be aggressive and sometimes produce non-dictionary words, which might lose some meaning.

**Impact of Lemmatization:**
1. Lemmatization results in meaningful base forms of words, improving text quality for analysis.
2. It preserves the context and reduces the chances of creating non-dictionary words compared to stemming.

**Impact of Stop Word Removal:**
Removing stop words reduces noise in the text and focuses on the meaningful words.
It helps in improving the efficiency and effectiveness of text analysis tasks such as text classification and topic modeling.

**Each preprocessing step has a distinct impact on the text corpus:**

**Tokenization:** Breaks text into smaller units (sentences and words), making it easier to process and analyze.
**Stemming:** Reduces words to their root forms, helping in normalizing text but sometimes creating non-words.
**Lemmatization:** Provides meaningful base forms by considering context, improving over stemming.
**Stop Word Removal:** Eliminates common, less meaningful words, enhancing focus on significant words.