# Text Classification
The Twitter dataset (`tweets.csv`) was collected in February of 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

You should build an end-to-end NLP pipeline to predict the sentiment class (i.e., positive, negative, or neutral) given a tweet. In particular, you should do the following.

## With Traditional Models
- Load the `tweets` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end NLP pipeline, including a text representation model, such as [TF-IDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), and a traditional classification model, such as [naive bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).
- Optimize your pipeline by validating your design decisions.
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

## With Deep Learning Models
- Load the `tweets` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end NLP pipeline, including a deep learning classification model, such as DistilBERT using [Hugging Face Transformers](https://huggingface.co/docs/transformers/v4.17.0/en/tasks/sequence_classification).
- Optimize your pipeline by validating your design decisions.
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [1]:
import pandas as pd
import sklearn.model_selection

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/m-mahdavi/teaching/main/datasets/tweets.csv')

In [3]:
df_train, df_test = sklearn.model_selection.train_test_split(df, test_size = 0.2)
print(f'DF size: {df.shape}')
print(f'DF Train size: {df_train.shape}')
print(f'DF Test size: {df_test.shape}')

DF size: (14640, 15)
DF Train size: (11712, 15)
DF Test size: (2928, 15)


## Traditional Method

### NLP Pipeline

### Step 1: Lowercasing

In [4]:
tweets_train = df_train['text'].to_list()

In [5]:
type(tweets_train)

list

In [6]:
tweets_train_lower = []

for i in range(len(tweets_train)):
  tweets_train_lower.append(tweets_train[i].lower())

In [7]:
type(tweets_train_lower)

list

### Step 2: Tokenization

In [8]:
import nltk

In [9]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [10]:
len(tweets_train_lower)

11712

In [11]:
for sentence in tweets_train_lower:
  words = nltk.word_tokenize(sentence)

In [12]:
words[0:10]

['@',
 'americanair',
 'we',
 "'re",
 'on',
 'aa1401',
 'landed',
 'at',
 '8:55pm',
 'in']

### Step 3: Removing Tags/Handles (e.g. @GISMA)

In [13]:
import re

In [14]:
words_filt = [re.sub('[^a-zA-Z0-9]+', '', word) for word in words]

In [15]:
words_filt[0:10]

['', 'americanair', 'we', 're', 'on', 'aa1401', 'landed', 'at', '855pm', 'in']

In [16]:
words_filt = list(filter(None, words_filt))

In [22]:
words_filt[0:10]

['americanair',
 'we',
 're',
 'on',
 'aa1401',
 'landed',
 'at',
 '855pm',
 'in',
 'miami']

### Step 4: Removing stopwords

In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
from nltk.corpus import stopwords

In [23]:
stop_words = set(stopwords.words('english'))
tweets_train_ready = [word for word in words_filt if word not in stop_words]

In [24]:
tweets_train_ready[0:1]

['americanair']

In [25]:
type(tweets_train_ready)

list

## Traditional Model
- Text Representation Model: TF-IDF Vectorizer
- Traditional Classification Model: Naive Bayes

## Deep Learning Models
- Deep Learning Classification Model: DistilBERT such as Hugging Face Transformers