# Contents

1. [Intro to this section](#intro)
2. [Analysis of Logistic Coefficents](#analysis)
3. [Conclusions](#conclusion)


---

# Logistic Model + What is Condecenscion? <a id='intro'></a>

As mentioned in the problem statement, one of the goals is to identify common condescending patterns. There are a lot of very powerful models available for NLP but most of them are neural nets and those are hard to understand. Since there isn't a lot of research as to what being condescending is I decided to use a very
simple model to do some predictions first.

Count vectorizing the words, along with logistic regression, creates a model with easily explainable parameters. If the coefficient is bigger then it's a more condescending word, and if it's smaller it's less.

Hopefully with this we can get some insight into things that a model should use to predict condescenscion.

In [48]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score

In [2]:
# read in data (that we cleaned in the prev notebook)
cond_df = pd.read_csv("./cond_data/added_features/balanced_train_more_features.csv")
cond_df.drop("Unnamed: 0", axis = 1, inplace=True)
cond_df.head(3)

Unnamed: 0,quotedpost,quotedreply,label,post,reply,post_user,reply_user,start_offset,end_offset,reddit_post_id,reddit_reply_id,has_cond,post_len,reply_len,cleaned_post,cleaned_reply
0,Please educate yoyrself before you bring your ...,"Not condescending at all, jeez.",True,"Well a guy is saying Barra, who has those grea...",> Please educate yoyrself before you bring you...,StalinHimself,Kel_Casus,135,208,dbl4vl9,dblfraj,1,37,17,"Well a guy is saying Barra, who has those grea...","Not condescending at all, jeez."
1,There might be some small piece that's incorrect,You said that. Not me. Not James-Cizuz. You sa...,True,> I think you're the one who has a reading com...,> theories are constantly growing and evolving...,kishi,jids,365,413,c2dtpq9,c2dtywp,1,314,230,"Well you're a stupid poopy-head.\n\nSee, I don...",Why would theories self-correct if they were a...
2,If I try and force down a breakfast I start ga...,Yes!\n\nPeople were so condescending about it ...,False,For me it's like temporarily having the flu. T...,> If I try and force down a breakfast I start ...,amphetaminesfailure,CowGiraffe,331,383,cuv97mf,cuvnb27,1,107,179,For me it's like temporarily having the flu. T...,Yes!\n\nPeople were so condescending about it ...


I want to look at both the post and the reply, so I need to combine both

In [3]:

##################################
# Some code from the prev notebook
##################################

# It's just wrappers for the word stemmer and todense() functions
# so I can put them into a pipeline

from nltk.stem import PorterStemmer, WordNetLemmatizer

class BaseClass:
    def __init__(self):
        pass
    # class must implement fit and xform
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return self

class StemOrLemmatizer(BaseClass):
    
    def __init__(self, cols, choice = 'stem'):
        
        self._return_df = True
        if isinstance(cols, str):
            cols = [cols]
        if len(cols) == 1:
            self._return_df = False
        
        # It will only lemmatize the columns that you specify here
        self.cols_to_encode = cols
        
        if choice not in ["stem", "lemma"]:
            raise Exception("choice parameter can only be 'stem' or 'lemma'")
        self.choice = choice
        
    def transform(self, X, y=None):
        
        # create list of lists
        # the outer list is basically each column
        # the inner list is the entries in each column
        list_of_lists = []
        
        if self.choice == "stem":
            lemma = PorterStemmer()
        else:
            lemma = WordNetLemmatizer()
        
        #loop through all columns
        for i, col_name in enumerate(self.cols_to_encode):
            # add a new list (i.e. a new column)
            list_of_lists.append([])
            
            # loop through each column and append to list
            # not the most computationally efficient but w/e
            for sentence in X[col_name]:
                # loop through each word and lemmatize/stem it
                split_words = sentence.split()
                
                if self.choice == "stem":
                    split_words = [lemma.stem(s) for s in split_words]
                else:
                    split_words = [lemma.lemmatize(s) for s in split_words]
                    
                new_text = " ".join(split_words)
                
                # save this to the list
                list_of_lists[i].append(new_text)
                
        # if you only have one column, it should return a series (not DF)
        # this is to allow it to pass directly into TFIDF without causing errors
        if self._return_df == False:
            return pd.Series(list_of_lists[0])
        
        # well turns out my list was the wrong way, so just transpose it
        return pd.DataFrame(list_of_lists).transpose()

# and a class that dense-s it after vectorizing
class ToDense(BaseClass):
    def transform(self, data):
        return data.todense()

In [4]:
# import more model things
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# its useful later
import copy

I am using a n-gram range of 2 because I want to know if there are any condescending phrases, not just specific words. 3 would take too long so I just used 2.

In [14]:
# Similar pipeline to the previous notebook (uses different vectorizer and classifier)
# For both the post and reply, stem > vectorize > convert to dense matrix
# then, combine both and feed into logistic regression

#############################################################

# for posts
post_pipeline_steps = [("stem", StemOrLemmatizer(cols = ["cleaned_post"])), # we'll still stem it
               ("cvec", CountVectorizer(stop_words = ENGLISH_STOP_WORDS, min_df = 10, ngram_range=(1,2))), # use cvect for easy interpretation
               ("dense", ToDense())] # still need this

# for replies, it's basically the same so make a copy
# (we want different objs so make deep copy)
reply_pipeline_steps = copy.deepcopy(post_pipeline_steps)

# it uses a different col so just change that
reply_pipeline_steps[0] = ("stem", StemOrLemmatizer(cols = ["cleaned_reply"]))


# make pipelines
post_pipeline = Pipeline(post_pipeline_steps)
reply_pipeline = Pipeline(reply_pipeline_steps)

# combine the 2 individual pipelines
combined = FeatureUnion([("post", post_pipeline),
                         ("reply", reply_pipeline)])

# add in the logistic regression to the end
logreg_pipe = Pipeline([("combine", combined),
                        ("logreg", LogisticRegression(max_iter = 1000))])

In [50]:
# get train and test
X_train, X_test, y_train, y_test = train_test_split(cond_df.drop("label", axis = 1),
                                                    cond_df["label"],
                                                    random_state=420)

In [16]:
logreg_pipe.fit(X_train, y_train);

In [17]:
test_predictions = logreg_pipe.predict(X_test)

In [18]:
# see how well it did on the test data
from sklearn.metrics import accuracy_score
accuracy_score(y_test, test_predictions)

0.7311827956989247

In [19]:
# how well it did on the training data
accuracy_score(y_train, logreg_pipe.predict(X_train))

0.999231950844854

Actually this is quite promising since the accuracy is high (better than the previous model). Although I have to tune it (regularize the logistic regression), but that shouldn't affect the explainability of the model since it just affects the coefficients.

It's pretty fast to do a grid search so let's just do one right now.

In [21]:
search_params = {"logreg__C" : [1/i for i in np.linspace(1, 10, 4)],
                 "logreg__l1_ratio" : np.linspace(0, 1, 3),
                 "logreg__penalty" : ["elasticnet"],
                 "logreg__solver" : ["saga"]
                 }

gridsearch = GridSearchCV(logreg_pipe, search_params, n_jobs=-1, cv = 4, verbose =1 )

In [22]:
gridsearch.fit(X_train, y_train)
print("done")

Fitting 4 folds for each of 12 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed: 26.3min finished


done


In [23]:
gridsearch.best_estimator_

Pipeline(steps=[('combine',
                 FeatureUnion(transformer_list=[('post',
                                                 Pipeline(steps=[('stem',
                                                                  <__main__.StemOrLemmatizer object at 0x00000216482B2130>),
                                                                 ('cvec',
                                                                  CountVectorizer(min_df=10,
                                                                                  ngram_range=(1,
                                                                                               2),
                                                                                  stop_words=frozenset({'a',
                                                                                                        'about',
                                                                                                        'above',
                 

As we can see the best logistic regression is elastic net with C = 0.1

In [24]:
gridsearch.best_score_

0.7631791449234022

In [49]:
val_predictions = gridsearch.predict(X_test)

In [51]:
roc_auc_score(y_test, val_predictions)

0.7765880132137801

The ROC AUC/accuracy is pretty good too. So now that we have a decently good model I want to look at what the model is using to predict the classification.

In [25]:
# get the best model
best_logistic_model = gridsearch.best_estimator_

In [26]:
# from this model, get the coeffs of the logistic regression
logistic_coeffs = best_logistic_model.named_steps["logreg"].coef_

# get the words from the post
post_vocab = dict(best_logistic_model.named_steps["combine"].transformer_list) \
                .get("post").named_steps["cvec"].get_feature_names()

# words from reply
reply_vocab = dict(best_logistic_model.named_steps["combine"].transformer_list) \
                .get("reply").named_steps["cvec"].get_feature_names()

In [27]:
# need a combined list so I can put it into a big dataframe
all_vocab = copy.deepcopy(post_vocab)
all_vocab.extend(reply_vocab)

In [28]:
# need a column that says whether the word is initially from post or reply
# since words are likely to be repeated between the 2
original_source = [1] * len(post_vocab)
original_source.extend([0] * len(reply_vocab))

In [29]:
# make into dataframe so its easier to use
logistic_results = pd.DataFrame([all_vocab, logistic_coeffs[0], original_source]).transpose()

In [30]:
# rename cols
logistic_results.rename({0 : "Word",
                         1 : "Logistic coefficient",
                         2 : "Word from post"}, axis = 1, inplace= True)

In [31]:
# we want to sort by magnitude so just put a column that is the absolute value
logistic_results["abs_coeff"] = np.abs(logistic_results["Logistic coefficient"])

In [35]:
# show the top most effective words for classification
logistic_results.sort_values("abs_coeff", ascending=False).head(50)

Unnamed: 0,Word,Logistic coefficient,Word from post,abs_coeff
4209,condescend,2.92917,0,2.92917
4242,condescending,2.91712,0,2.91712
4251,condescendingli,2.19894,0,2.19894
4569,don mean,-1.17573,0,1.17573
6648,sound condescending,-0.904462,0,0.904462
617,condescens,-0.8053,1,0.8053
5751,need condescending,0.716781,0,0.716781
6711,stay,-0.542961,0,0.542961
7190,wa condescending,-0.525116,0,0.525116
7005,took,-0.446282,0,0.446282


## Analysis of coefficients <a id='analysis'></a>

As expected the top words are just different tenses of the word "condescending". The stemmer didn't really work that well since they ended up as different words but as we have a lot of data it was still ok.

**The most interesting thing to note is**: the most useful words that the model is using to predict the outcome are ones in the reply. In a sense, by looking at how people respond, we can determine whether the initial post is condescending or not. This means that the model isn't predicting condescenscion in general but rather what the sentiment of the reply is.

This is bad if we want to make a model that identifies condescending Reddit posts, since it only works if someone replies (although once they reply it becomes quite accurate).

It also means for more complicated models, I should try some kind of sentiment analysis on the replies.

---

These are the words from the **reply** that the model will use to predict, along with a '+' or '-' for whether the word is associated with condescenscion
- Various tenses of the word 'condescending' (+)
- "don't mean" (-)
- "stay" (-)
- "took" (-)
- "f\*\*k" (+)
- "ban" (-)
- "lot" (-)
    
Thus we can see that replies to a condescending post fall into these categories:

1. uses strong language to reply
2. Points out something in the original post itself, e.g: 'tone', 'comment', 'assumption'

And the replies to a condescending post don't use the words shown above (with a '-' sign) although I don't really know what the pattern is. 

---

In addition, the best words for prediction from the **post** are:
- "condescending" (-) (if this word is present in the post, it means it is much **less** likely to be condescending")
- "sell" (+)
- "lol" (+)
- "concern" (-)
- "tone" (-)
- "thank" (-)
- "speak (-)

The most obvious thing about this is that if someone says the word condescending, they are probably trying not to be condescending, which kind of makes sense.

### Classification from only the post
From the above we can see that a model will likely predict the outcome using mostly the reply. I want to know if we can just look at the post for prediction.

Let's just try a logistic model on the posts only and see what happens. I know in the first notebook I already did something like this, but I didn't analyze the coefficients (which is really what we are going for here) and also logistic regression works better than the naive bayes classifier used before.

In [36]:
# make a pipeline again
post_only_steps = [("stem", StemOrLemmatizer(cols=["cleaned_post"])),
                   ("cvec", CountVectorizer(stop_words=ENGLISH_STOP_WORDS, min_df = 10, ngram_range=(1,2))),
                   ("dense", ToDense()),
                   ("logistic", LogisticRegression(penalty="elasticnet",
                                                   C=0.1,
                                                   solver="saga",
                                                   max_iter=10000))]
                    # ^ use the same parameters we found earlier

post_only_pipeline = Pipeline(post_only_steps)

In [37]:
# try a grid search
grid_params = {"logistic__l1_ratio" : [0.3, 0.4, 0.5]}

post_gridsearch = GridSearchCV(post_only_pipeline, grid_params, n_jobs = -1, cv = 4, verbose = 1)

In [38]:
post_gridsearch.fit(X_train, y_train)

Fitting 4 folds for each of 3 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of  12 | elapsed:  1.9min remaining:  9.4min
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:  2.1min finished


GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('stem',
                                        <__main__.StemOrLemmatizer object at 0x0000021649AB61C0>),
                                       ('cvec',
                                        CountVectorizer(min_df=10,
                                                        ngram_range=(1, 2),
                                                        stop_words=frozenset({'a',
                                                                              'about',
                                                                              'above',
                                                                              'across',
                                                                              'after',
                                                                              'afterwards',
                                                                              'again',
                                

See how well it does:

In [39]:
post_gridsearch.best_score_

0.5770736677181737

In [46]:
val_predictions = post_gridsearch.predict(X_test)

In [47]:
roc_auc_score(y_test, val_predictions)

0.5705663048607834

The accuracy and AUC ROC is really bad (as expected).

I'll just look at the coeffs anyway, since I already got to this step.

In [40]:
posts_only_best_model = post_gridsearch.best_estimator_

In [41]:
# get coefficients of logistic regression
logistic_coeffs = posts_only_best_model.named_steps["logistic"].coef_

# get the words used to vectorize
post_vocab = posts_only_best_model.named_steps["cvec"].get_feature_names()

# put it into a big DF
post_only_words_df = pd.DataFrame([post_vocab, logistic_coeffs[0]]).transpose()

# change the column names
post_only_words_df.rename({0 : "word", 1: "Logistic Coeff"}, axis = 1, inplace=True)

post_only_words_df["abs_coeff"] = np.abs(post_only_words_df["Logistic Coeff"])

In [43]:
post_only_words_df.sort_values(by = "abs_coeff", ascending=False).head(20)

Unnamed: 0,word,Logistic Coeff,abs_coeff
2794,short,-0.526347,0.526347
3084,thank,-0.42569,0.42569
1855,lol,0.358458,0.358458
2750,sell,0.354544,0.354544
610,concern,-0.346834,0.346834
2182,outright,-0.339343,0.339343
2122,obviou,0.331344,0.331344
1878,luck,0.32594,0.32594
935,downvot,0.292981,0.292981
987,elect,0.292696,0.292696


It's a bit hard to see what the pattern is for this. I tried grouping the words by which category they were in:

Words that are more condescending:
- lol
- sell
- obvious
- luck
- downvote
- elect
- understand

Words that are less condescending:
- short
- thank
- concern
- outright
- serve

(Note that I only vectorized words with >=10 total occurrences)

I can't really tell if there is any pattern here and since the accuracy is low, I don't think it's worth the time to look at this more.

# Conclusion for this notebook <a id='conclusion'></a>
In this notebook we found a few important things

- Predicting whether a post is condescending or not mostly depends on the reply and not the post itself. Partly it's because our model is kind of simple.
    - If you try using just the post to predict, you probably need a more advanced model since logistic regression isn't going to work.
- When someone is condescending, the replies will usually be one of these two:
    - Using strong language (e.g: "don't be a prick")
    - Refers to something in the original post (e.g: "your post is ___")
- People who mention the word 'condescending' are trying not to be condescending.

# What next?
Since replies to condescending posts seem to fall into 2 categories/topics, I want to try and model this change of topic. Please see the next notebook for this.