# Feature Engineering for text

## Text preprocessing

Oftentimes textual data contains a lot of noise and redundancies which can potentially decrease the performance of a machine learning model trained on these. Luckily there are quite a few approaches to filter out a lot of the unwanted text and leave us with the most significant features.

In [None]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt

In [None]:
s = "Hello there, this Is an Example sentence containing 9 words!"

In [None]:
def remove_punctuation(text):
    return re.sub("[.,!?:;-='...\"@#_]", "", text)

s_clean = remove_punctuation(s)

print(s)
print(s_clean)

In [None]:
def remove_numbers(text):
    return re.sub("\\d+", "", text)

s_clean = remove_numbers(s_clean)

print(s)
print(s_clean)

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
print(stop_words)

In [None]:
def remove_stopwords(text,stopwords):
    s_stop = [w for w in text.split() if w not in stop_words]
    return " ".join(s_stop)

def remove_stopwords_lower(text,stopwords):
    s_stop = [w for w in text.lower().split() if w not in stop_words]
    return " ".join(s_stop)

s_clean = remove_stopwords(s_clean,stop_words)
print(s)
print(s_clean)

s_clean = remove_stopwords_lower(s_clean,stop_words)
print(s_clean)

### Stemming / Text normalization

Textual data contains a lot of different variations of the same word, so when we are counting the occurences we are not getting the right results. To map the counts of all these slight variations to the same base word, different techniques can be applied. This whole process is called ___stemming___ as we try to reduce each word down to its word stem form. 

In [None]:
import nltk

stemmer = nltk.stem.porter.PorterStemmer()
stemmer.stem("currencies")

In [None]:
# Implement a function that uses stemming to bring down each word in a given text to its base form and then return the edited text.
# Use this function to further clean up the example text.

def stem_words_in_text(text,stemmer):
    stemmed_text = [stemmer.stem(w) for w in text.split()]
    return " ".join(stemmed_text)

s_clean_stemmer = stem_words_in_text(s_clean,stemmer)

print(s)
print(s_clean_stemmer)

In [None]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("currencies")

In [None]:
# Implement a function that uses lemmatization to bring down each word in a given text to its base form and then return the edited text.
# Use this function to further clean up the example text.

def lemmatize_words_in_text(text,lemmatizer):
    lemmatized_text = [lemmatizer.lemmatize(w) for w in text.split()]
    return " ".join(lemmatized_text)

lemmatizer = WordNetLemmatizer()

s_clean_lemmatizer = lemmatize_words_in_text(s_clean,lemmatizer)
print(s_clean_lemmatizer)

## Bag-of-Words

In order to classify or analyze textual data, we first need to transform it into a numerical representation. One of the simplest approaches to do this is called ___Bag-of-Words___. All words and their respective counts in the dataset are used to create the feature vector for our machine learning model. 

### Exercise

Your task in this exercise is to take this "deepnlp" dataset and use it to perform classification of individual texts. Take a look at the provided data and transform the dataframe in a suitable way, the aim is to correctly predict the values in the "class" column.
Preprocess the texts first, then create the Bag-of-Words that will be used in training and testing of the classification model (logistic regression). Evaluate the score of the model on the test dataset and create a confusion matrix from the results. You should furthermore use this matrix to calculate metrics like precision, recall and accuracy and plot these in a Precision-Recall-Curve.

In [None]:
df = pd.read_csv("deepnlp/Sheet_1.csv")
print(df.info())
df.head()

In [None]:
df = df.dropna(subset=["class","response_text"])

df["class"] = df["class"].apply(lambda x: 1 if x=="not_flagged" else -1)
df["class"].plot(kind="hist")
plt.title("Values of \"class\" column")
plt.show()

df["response_text"] = df["response_text"].apply(remove_punctuation)
df["response_text"] = df["response_text"].apply(remove_numbers)

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split

X1 = df[["response_text"]]
y = df[["class"]]

X1_train, X1_test, y_train, y_test = train_test_split(X1,y,test_size=0.25, random_state=2564)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern="(?u)\\b\\w+\\b")

train_bows1 = vectorizer.fit_transform(X1_train["response_text"])
test_bows1 = vectorizer.transform(X1_test["response_text"])
t1 = vectorizer.get_feature_names_out()

print("len(t1): {}".format(len(t1)))

In [None]:
# https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expected
def evaluate_score(X_train, X_test, y_train, y_test):
    model = linear_model.LogisticRegression().fit(X_train, y_train.values.ravel())
    score = model.score(X_test, y_test.values.ravel())
    return (model, score)

(model1, r1) = evaluate_score(train_bows1,test_bows1,y_train,y_test)

print("Score without stopwords removed: %0.3f" % r1)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_recall_curve

preds_model1 = model1.predict(test_bows1)

conf_matrix1 = confusion_matrix(preds_model1, y_test)

disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix1)
disp.plot()
plt.title("Confusion matrix for model1")
plt.show()

In [None]:
# TN = conf_matrix1[0][0]
# FN = conf_matrix1[1][0]
# TP = conf_matrix1[1][1]
# FP = conf_matrix1[0][1]
def calc_precision(conf_matrix):
    return conf_matrix[1][1]/(conf_matrix[1][1]+conf_matrix[0][1])
    
def calc_recall(conf_matrix):
    return conf_matrix[1][1]/(conf_matrix[1][1]+conf_matrix[1][0])

def calc_accuracy(conf_matrix):
    return (conf_matrix[1][1]+conf_matrix[0][0])/(conf_matrix[0][0]+conf_matrix[0][1]+conf_matrix[1][0]+conf_matrix[1][1])

def calc_metrics(conf_matrix):
    return calc_precision(conf_matrix), calc_recall(conf_matrix), calc_accuracy(conf_matrix)

prec1,rec1,acc1 = calc_metrics(conf_matrix1)

precision, recall, thresholds = precision_recall_curve(preds_model1,y_test)
plt.plot(recall,precision,label="model1")

plt.title("Precision-Recall curve")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend()
plt.show()

#### Extension

Now extend your work by applying different transformations, as shown above, as well as different n_grams to the dataset. Your task is to cover all the combinations listed below and find the best option for this classification task. Try to use lists, functions and loops to automatically generate the results as there are 18 different models to be considered:
<br>

df:
 1. untouched 
     1. 1-gram
     1. 2-gram
     1. 3-gram        
 1. stem
     1. 1-gram
     1. 2-gram
     1. 3-gram
 1. lemma 
     1. 1-gram
     1. 2-gram
     1. 3-gram

df_stopwords:
 1. untouched 
     1. 1-gram
     1. 2-gram
     1. 3-gram        
 1. stem
     1. 1-gram
     1. 2-gram
     1. 3-gram
 1. lemma 
     1. 1-gram
     1. 2-gram
     1. 3-gram
    
<br>
Tips: 

- df.copy()
- x1_train, x1_test, x2_train, x2_test ... = train_test_split(X1,X2,...,Xx,y,test_size=0.25, random_state=2564) (you can individually split an arbitrary amount of input data)                

In [None]:
df = pd.read_csv("deepnlp/Sheet_1.csv")
print(df.info())
df.head()

df = df.dropna(subset=["class","response_text"])

df["class"] = df["class"].apply(lambda x: 1 if x=="not_flagged" else -1)
df["class"].plot(kind="hist")
plt.title("Values of \"class\" column")
plt.show()

df["response_text"] = df["response_text"].apply(remove_punctuation)
df["response_text"] = df["response_text"].apply(remove_numbers)

df_stem = df.copy()
df_stem["response_text"] = df_stem["response_text"].apply(stem_words_in_text,stemmer=stemmer)

df_lemma = df.copy()
df_lemma["response_text"] = df_lemma["response_text"].apply(lemmatize_words_in_text,lemmatizer=lemmatizer)

df_stopwords = df.copy()
df_stopwords["response_text"] = df_stopwords["response_text"].apply(remove_stopwords_lower,stopwords=stop_words)

df_stopwords_stem = df_stopwords.copy()
df_stopwords_stem["response_text"] = df_stopwords_stem["response_text"].apply(stem_words_in_text,stemmer=stemmer)

df_stopwords_lemma = df_stopwords.copy()
df_stopwords_lemma["response_text"] = df_stopwords_lemma["response_text"].apply(lemmatize_words_in_text,lemmatizer=lemmatizer)

In [None]:
X1 = df[["response_text"]]
X2 = df_stem[["response_text"]]
X3 = df_lemma[["response_text"]]
X4 = df_stopwords[["response_text"]]
X5 = df_stopwords_stem[["response_text"]]
X6 = df_stopwords_lemma[["response_text"]]

y = df[["class"]]


X1_train, X1_test, X2_train, X2_test, X3_train, X3_test, X4_train, X4_test, X5_train, X5_test, X6_train, X6_test, y_train, y_test = train_test_split(X1,X2,X3,X4,X5,X6,y,test_size=0.25, random_state=2564)

In [None]:
def create_bows(train_df,test_df,column_name,vectorizer):
    return [vectorizer.fit_transform(train_df[column_name]),vectorizer.transform(test_df[column_name])]

all_dfs = [[X1_train,X1_test],[X2_train,X2_test],[X3_train,X3_test],[X4_train,X4_test],[X5_train,X5_test],[X6_train,X6_test]]
all_data = []
all_preds = []
all_cms = []
all_models = []
all_prcs = []

vectorizer = CountVectorizer(token_pattern="(?u)\\b\\w+\\b")
vectorizer_bigram = CountVectorizer(ngram_range=(2,2),token_pattern="(?u)\\b\\w+\\b")
vectorizer_trigram = CountVectorizer(ngram_range=(3,3),token_pattern="(?u)\\b\\w+\\b")

for dfs in all_dfs:
    all_data.append(create_bows(dfs[0],dfs[1],"response_text",vectorizer))
    all_data.append(create_bows(dfs[0],dfs[1],"response_text",vectorizer_bigram))
    all_data.append(create_bows(dfs[0],dfs[1],"response_text",vectorizer_trigram))
    
for data in all_data:
    all_models.append(evaluate_score(data[0],data[1],y_train,y_test))

for i,model in enumerate(all_models):
    print("Score for model{}: {:0.3f}".format(i+1,model[1]))
    
    preds = model[0].predict(all_data[i][1])
    all_preds.append(preds)
    
    conf_matrix = confusion_matrix(preds, y_test)
    all_cms.append(conf_matrix)
    
    disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
    disp.plot()
    plt.title("Confusion matrix for model{}".format(i+1))
    plt.show()
    
    precision, recall, thresholds = precision_recall_curve(preds,y_test)
    all_prcs.append([precision,recall,thresholds])
    

for i,prc in enumerate(all_prcs):
    plt.plot(prc[1],prc[0],label="model{}".format(i+1))
plt.legend()
plt.show()