# HW5 Skeleton Code
Please note that this skeleton code is provided to help you with homework.
Full description of each question can be found on HW5.pdf, so please read instruction of each question carefully. There might be some questions that is not presented in this code.

In [2]:
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

## Q. Changing HTML Text to Plain Text

The Python library <b>BeautifulSoup</b> is useful for dealing with html text. In order to use this library, you will need to install it first by running the following command: 
 <b>conda install beautifulsoup4</b> 
 in the terminal.
 <br> In the code, you can import it by running the following line: 
<br> 
  <b>from bs4 import BeautifulSoup </b>
<br>

### Preliminary remark

My first remark is that when I split the train and test sets from the start, my vectorization will cause me problems. Indeed, the features will not be the same for the train dataset and the test one. I will not be able to use the trained model for prediction on the test set. Thus, I will concatenate both dataframes and only split them at the very end, to make sure all the featured words are contained in both training and test sets eventually.

In [3]:
  #Read our data file

df_train = pd.read_csv('stack_stats_2023_train.csv') #Todo
df_test = pd.read_csv('stack_stats_2023_test.csv') #Todo

#as explained before, I will concatenate the datasets
df = pd.concat([df_train, df_test])

In [4]:
#Cleaning 'Body'
#Change HTML Text to Plain text using get_text() function from BeautifulSoup
#If you are not familiar with the apply method, please check discussion week 10 lecture and code.

df['Body'] = df['Body'].apply(lambda  x: BeautifulSoup(x, 'html.parser').get_text() ) #Todo
#Manually cleaned up newline tag \n and tab tag \t.
df['Body'] = df['Body'].apply(lambda  x: x.replace('\n', '')) #Todo
df['Body'] = df['Body'].apply(lambda x: x.replace('\t', '')) #Todo
#If you need any other cleaning process, please uncomment the below.
#df_train['Body'] = df_train['Body'].apply(lambda ) #Todo

#Cleaning Tags
#This would be somewhat similar to the above.

#Todo: Clean Tags, please feel free to add any lines below
df['Tags'] = df['Tags'].apply(lambda  x: x.replace('<', '').replace('>', ' ')) 


#Todo: Repeat the same process for test dataset 
# => no need to do this in my situation



In [5]:
df.head()

Unnamed: 0,Id,Score,Body,Title,Tags
0,502641,1,I'm a master's student in EECS working my way ...,Why does the PyTorch tutorial on DQN define st...,machine-learning reinforcement-learning q-lear...
1,477291,1,"I do not know if this is a good question, but ...",Does random walking have a memory?,probability law-of-large-numbers
2,448489,4,I am doing 10 times repeated 10-fold cross-val...,Which statistic to report for repeated cross-v...,cross-validation
3,487075,0,"I have a dataset with 1MM records, around 40 f...",Binary classification on imbalanced data - odd...,unbalanced-classes calibration
4,481670,2,I want to run a regression where one of the ex...,How to best summarize Likert data (to use as a...,multiple-regression missing-data likert item-r...


## Q. Basic Text Cleaning and Merging into a single Text data

### Change to Lower Case, Remove puncuation, digits, 

In [6]:
#Change to Lowercase

df[['Body','Title','Tags']] = df[['Body','Title','Tags']].applymap(str.lower) #Todo, do you see why we used applymap instead of apply in this case? 



In this cas, we used applymap because we are mapping the function we want to use, here it is 'apply', accross different tags or categories of data and we want the function to be applied dependently on each of those categories. 

In [7]:
#Remove Punctations 
from string import punctuation

#You can get this function from our discussion session code. However, we leave it as a blank for a practice.
def remove_punctuation(document):

    no_punct = ''.join([character for character in document if character not in punctuation])
    
    return no_punct

df[['Body','Title','Tags']] = df[['Body','Title','Tags']].applymap(remove_punctuation)

In [8]:
#Remove Digits 

def remove_digit(document): 
    
    no_digit = ''.join([character for character in document if not character.isdigit()])
              
    return no_digit

df[['Body','Title','Tags']] = df[['Body','Title','Tags']].applymap(remove_digit)

In [9]:
df.head()

Unnamed: 0,Id,Score,Body,Title,Tags
0,502641,1,im a masters student in eecs working my way to...,why does the pytorch tutorial on dqn define st...,machinelearning reinforcementlearning qlearning
1,477291,1,i do not know if this is a good question but i...,does random walking have a memory,probability lawoflargenumbers
2,448489,4,i am doing times repeated fold crossvalidatio...,which statistic to report for repeated crossva...,crossvalidation
3,487075,0,i have a dataset with mm records around featu...,binary classification on imbalanced data odd ...,unbalancedclasses calibration
4,481670,2,i want to run a regression where one of the ex...,how to best summarize likert data to use as an...,multipleregression missingdata likert itemresp...


### Tokenization and Remove Stopwords and do stemming

In [10]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
df[['Body','Title','Tags']] = df[['Body','Title','Tags']].applymap(word_tokenize)

[nltk_data] Error loading punkt: <urlopen error [Errno 8] nodename nor
[nltk_data]     servname provided, or not known>


In [11]:
#Remove Stopwords

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(document):
    
    words = [word for word in document if not word in stop_words]
    
    return words

df[['Body','Title','Tags']] = df[['Body','Title','Tags']].applymap(remove_stopwords)

[nltk_data] Error loading stopwords: <urlopen error [Errno 8] nodename
[nltk_data]     nor servname provided, or not known>


In [12]:
#We use porter stemming 

from nltk.stem import PorterStemmer

porter = PorterStemmer()

def stemmer(document):
    
    stemmed_document = [porter.stem(word) for word in document]
    
    return stemmed_document

df[['Body','Title','Tags']] = df[['Body','Title','Tags']].applymap(stemmer)

## Let's Check our dataframe

In [13]:
df.head(5)

Unnamed: 0,Id,Score,Body,Title,Tags
0,502641,1,"[im, master, student, eec, work, way, toward, ...","[pytorch, tutori, dqn, defin, state, differ]","[machinelearn, reinforcementlearn, qlearn]"
1,477291,1,"[know, good, question, found, answer, anywher,...","[random, walk, memori]","[probabl, lawoflargenumb]"
2,448489,4,"[time, repeat, fold, crossvalid, want, report,...","[statist, report, repeat, crossvalid]",[crossvalid]
3,487075,0,"[dataset, mm, record, around, featur, class, i...","[binari, classif, imbalanc, data, odd, calibr,...","[unbalancedclass, calibr]"
4,481670,2,"[want, run, regress, one, explanatori, variabl...","[best, summar, likert, data, use, independ, va...","[multipleregress, missingdata, likert, itemres..."


### Q. Treat Three text data independently and merge into one column

In [14]:
#Treat Three types of data independently
#let's define functions that will help this operation

def add_body(document):
    
    added_document = [x + str('_body') for x in document]
    
    return added_document

def add_title(document):
    
    added_document = [x + str('_title') for x in document]
    
    return added_document

def add_tags(document):
    
    added_document = [x + str('_tags') for x in document]
    
    return added_document

In [15]:
df['Body'] = df['Body'].apply(add_body)
df['Title'] = df['Title'].apply(add_title)
df['Tags'] = df['Tags'].apply(add_tags)

In [16]:
#Now we need to merge all those 3 columns into a single column. Implement this below.
df['text'] = df['Body'] + df['Title'] + df['Tags']

## Let's check our DataFrame

In [17]:
df.head(5)

Unnamed: 0,Id,Score,Body,Title,Tags,text
0,502641,1,"[im_body, master_body, student_body, eec_body,...","[pytorch_title, tutori_title, dqn_title, defin...","[machinelearn_tags, reinforcementlearn_tags, q...","[im_body, master_body, student_body, eec_body,..."
1,477291,1,"[know_body, good_body, question_body, found_bo...","[random_title, walk_title, memori_title]","[probabl_tags, lawoflargenumb_tags]","[know_body, good_body, question_body, found_bo..."
2,448489,4,"[time_body, repeat_body, fold_body, crossvalid...","[statist_title, report_title, repeat_title, cr...",[crossvalid_tags],"[time_body, repeat_body, fold_body, crossvalid..."
3,487075,0,"[dataset_body, mm_body, record_body, around_bo...","[binari_title, classif_title, imbalanc_title, ...","[unbalancedclass_tags, calibr_tags]","[dataset_body, mm_body, record_body, around_bo..."
4,481670,2,"[want_body, run_body, regress_body, one_body, ...","[best_title, summar_title, likert_title, data_...","[multipleregress_tags, missingdata_tags, liker...","[want_body, run_body, regress_body, one_body, ..."


### Q. Detokenize and convert to document term matrices

In [18]:
#Merge Three text column into one column and detokenize

from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.feature_extraction.text import CountVectorizer

text = df['text'].apply(TreebankWordDetokenizer().detokenize)
countvec = CountVectorizer(min_df=0.05)
sparse_dtm = countvec.fit_transform(text)

In [19]:
#Convert the sparse dtm to pandas DataFrame.
dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names(), index=text.index)



### Q. Change dependent variable to binary variable

In [20]:
#Change 'Score' to a binary variable, which indicates whether the question is good or not.
y = np.where(df['Score'] > 0.5, 1, 0)

In [21]:
#Add y_train and y_test to your data frame if it is needed. Drop unnecessary columns
df['y'] = y
df.drop(columns = ['Score'], inplace = True)

## Let's check our DataFrame


In [22]:
df.head()

Unnamed: 0,Id,Body,Title,Tags,text,y
0,502641,"[im_body, master_body, student_body, eec_body,...","[pytorch_title, tutori_title, dqn_title, defin...","[machinelearn_tags, reinforcementlearn_tags, q...","[im_body, master_body, student_body, eec_body,...",1
1,477291,"[know_body, good_body, question_body, found_bo...","[random_title, walk_title, memori_title]","[probabl_tags, lawoflargenumb_tags]","[know_body, good_body, question_body, found_bo...",1
2,448489,"[time_body, repeat_body, fold_body, crossvalid...","[statist_title, report_title, repeat_title, cr...",[crossvalid_tags],"[time_body, repeat_body, fold_body, crossvalid...",1
3,487075,"[dataset_body, mm_body, record_body, around_bo...","[binari_title, classif_title, imbalanc_title, ...","[unbalancedclass_tags, calibr_tags]","[dataset_body, mm_body, record_body, around_bo...",0
4,481670,"[want_body, run_body, regress_body, one_body, ...","[best_title, summar_title, likert_title, data_...","[multipleregress_tags, missingdata_tags, liker...","[want_body, run_body, regress_body, one_body, ...",1


## (b) Please read the instruction carefully in the pdf.

In [23]:
#Let's define the X_train and X_test
from sklearn.model_selection import train_test_split

X = dtm
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=2)

Let's try different models to classify the useful vs not useful questions of the dataset. First we will try a baseline model that always predicts that the question is not useful since it's the most likely (predominant) class.

### Baseline

In [24]:
baseline_acc = 1-(sum(y_test)/len(y_test))
baseline_TPR = 0
baseline_FPR = 0
baseline_PRE = 'None'

Now let's try a Logistic Regression (first classification model we saw in class). 

### Logistic Regression 

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

logreg = LogisticRegression(random_state=2)
logreg.fit(X_train, y_train)
y_prob = logreg.predict_proba(X_test)
y_pred = pd.Series([1 if x > 0.5 else 0 for x in y_prob[:,1]])

cm = confusion_matrix(y_test, y_pred)
log_TPR = cm[1,1]/(cm[1,1]+cm[1,0])
log_FPR = cm[0,1]/(cm[0,0]+cm[0,1])
log_PRE = cm[1,1]/(cm[1,1]+cm[0,1])
log_acc = accuracy_score(y_test, y_pred)

### Decision Tree

In [26]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(min_samples_leaf=5, 
                             ccp_alpha=0.001,
                             random_state = 2)


dtc = dtc.fit(X_train, y_train)

In [27]:
#predict and evaluate 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
y_pred_dtc = dtc.predict(X_test)
cm = confusion_matrix(y_test, y_pred_dtc)
dtc_TPR = cm[1,1]/(cm[1,1]+cm[1,0])
dtc_FPR = cm[0,1]/(cm[0,0]+cm[0,1])
dtc_PRE = cm[1,1]/(cm[1,1]+cm[0,1])
dtc_acc = accuracy_score(y_test, y_pred_dtc)

### Random Forest 

In [28]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_features=6, min_samples_leaf=10, n_estimators=500, random_state=2)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
cm = confusion_matrix(y_test, y_pred_rf)
rf_TPR = cm[1,1]/(cm[1,1]+cm[1,0])
rf_FPR = cm[0,1]/(cm[0,0]+cm[0,1])
rf_PRE = cm[1,1]/(cm[1,1]+cm[0,1])
rf_acc = accuracy_score(y_test, y_pred_rf)

### LDA

In [29]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
y_pred_lda = lda.predict(X_test)
cm = confusion_matrix(y_test, y_pred_lda)
lda_TPR = cm[1,1]/(cm[1,1]+cm[1,0])
lda_FPR = cm[0,1]/(cm[0,0]+cm[0,1])
lda_PRE = cm[1,1]/(cm[1,1]+cm[0,1])
lda_acc = accuracy_score(y_test, y_pred_lda)

In [30]:
#Create Comparison Table
#These lines are provided for you to help construct a comparison table.
#It is not requred to follow this format. + You need to find ACC, TPR, FPR, PRE for each model that you choose.
comparison_data = {'Baseline':[baseline_acc,baseline_TPR,baseline_FPR, baseline_PRE],
                   'Logistic Regression':[log_acc,log_TPR,log_FPR, log_PRE],
                   'Decision Tree Classifier':[dtc_acc,dtc_TPR,dtc_FPR,dtc_PRE],
                   'Random Forest with CV':[rf_acc,rf_TPR, rf_FPR,rf_PRE],
                  'Linear Discriminant Analysis':[lda_acc,lda_TPR, lda_FPR,lda_PRE]}

comparison_table = pd.DataFrame(data=comparison_data, index=['Accuracy', 'TPR', 'FPR','PRE']).transpose()
comparison_table.style.set_properties(**{'font-size': '12pt',}).set_table_styles([{'selector': 'th', 'props': [('font-size', '10pt')]}])
comparison_table

Unnamed: 0,Accuracy,TPR,FPR,PRE
Baseline,0.50588,0.0,0.0,
Logistic Regression,0.569766,0.479637,0.3422,0.577889
Decision Tree Classifier,0.538005,0.237733,0.168704,0.579199
Random Forest with CV,0.57013,0.5,0.361371,0.574732
Linear Discriminant Analysis,0.571221,0.473749,0.333573,0.581101



### Report details of your training procedures and final comparisons on the test set in this cell. Use your best judgment to choose a final model and explain your choice.

The model that has the highest metrics is LDA, closely followed by the RF. 
Since LDA is better for multiclass classification under strong assumptions (independence and same sd accross features), since I cannot guarantee that those assumptions are met, I would chose the RF which is more interpretable. It also has a higher TPR and FPR. 

In [32]:
# Report Bootstrap Analysis 

from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score

def bootstrap_metrics(X, y, model, metric, n_bootstrap=50, alpha=0.05):
      n_samples = len(X)
      results = np.zeros((n_bootstrap, 1))
      for i in range(n_bootstrap):
            indices = np.random.choice(n_samples, size=n_samples, replace=True)
            X_boot = X.iloc[indices]
            y_boot = y[np.array(indices)]
            y_pred = model.predict(X_boot)
            results[i, :] = metric(y_boot, y_pred)
      ci_lower = np.percentile(results, 100*(alpha/2), axis=0)
      ci_upper = np.percentile(results, 100*(1-alpha/2), axis=0)
      average = np.mean(results)
      return(ci_lower, ci_upper, average)

def fpr(y_boot, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    fpr =fp/(fp+tn)
    return(fpr)

ci_tpr = bootstrap_metrics(X_test, y_test, rf, recall_score)
ci_fpr = bootstrap_metrics(X_test, y_test, rf, fpr)
ci_pre = bootstrap_metrics(X_test, y_test, rf, precision_score)
ci_acc = bootstrap_metrics(X_test, y_test, rf, accuracy_score)

# Print the confidence intervals
print("TPR mean: ", round(float(ci_tpr[2]),3), "\nConfidence Interval: ",
      list([round(float(ci_tpr[0]),3), round(float(ci_tpr[1]),3)]), "\n")
print("FPR mean: ", round(float(ci_fpr[2]),3), "\nConfidence Interval: ",
      list([round(float(ci_fpr[0]),3), round(float(ci_fpr[1]),3)]), "\n")
print("Precision mean: ", round(float(ci_pre[2]),3), "\nConfidence Interval: ",
      list([round(float(ci_pre[0]),3), round(float(ci_pre[1]),3)]), "\n")
print("Accuracy mean: ", round(float(ci_acc[2]),3), "\nConfidence Interval: ",
      list([round(float(ci_acc[0]),3), round(float(ci_acc[1]),3)]), "")

TPR mean:  0.501 
Confidence Interval: [ [0.484, 0.513] ]

FPR mean:  0.43 
Confidence Interval: [ [0.419, 0.439] ]

Precision mean:  0.576 
Confidence Interval: [ [0.56, 0.591] ]

Accuracy mean:  0.57 
Confidence Interval: [ [0.557, 0.579] ]


### (c)
We would like to maximize the probability that the top question is useful. This means that if our model has a low true positive rate, it might be very detrimental, way more than if it has a high false positive rate. We want our model to be precise, this is why the metric I would use is precision. 
All my models have a very similar precision, with LDA being slightly better. 
I notice that it's a different model than I chose in the previous question (for interpretabiity reasons).
On average, the probability of the baseline to display a useful question is simply the rate of useful questions in the dataset: around 49%. 
We thus have an improvement of around 9% on our model using LDA. 
It is a good start but I am sure I can improve the task more through hyperparameter tuning. 