# Text Classification with NLTK and Scikit-Learn
In this notebook, we will:
1. Install necessary libraries.
2. Load and inspect the dataset.
3. Preprocess the data.
4. Convert text data to numerical data.
5. Split the data into training and testing sets.
6. Build and train a classification model.
7. Evaluate the model's performance.

In [None]:
# Install necessary libraries
!pip install nltk scikit-learn pandas numpy

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Download NLTK data
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

## 1. Load and inspect the dataset
We'll use a sample dataset for text classification.

In [None]:
# Load the dataset
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Display the first 5 rows of the dataframe
df.head()

## 2. Preprocess the data
We'll preprocess the data by removing stop words and converting text to lowercase.

In [None]:
# Preprocess the data
df['message'] = df['message'].apply(lambda x: ' '.join(word.lower() for word in x.split() if word.lower() not in stop_words))

# Display the first 5 rows of the preprocessed dataframe
df.head()

## 3. Convert text data to numerical data
We'll use TfidfVectorizer to convert text data to numerical data.

In [None]:
# Convert text data to numerical data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']

# Display the shape of the features
X.shape

## 4. Split the data into training and testing sets
We'll split the data into 80% training and 20% testing sets.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## 5. Build and train a classification model
We'll use a Multinomial Naive Bayes model for this example.

In [None]:
# Build and train the model
model = MultinomialNB()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Display the predictions
y_pred

## 6. Evaluate the model's performance
We'll evaluate the model's performance using classification metrics.

In [None]:
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Display the evaluation results
accuracy, conf_matrix, class_report