<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

### Contents:
- [Data Cleaning](#Data-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Machine Learning Models](#Machine-Learning-Models)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Import libraries

In [24]:
import pandas as pd
import string
import nltk
import re
import seaborn as sns
import matplotlib.pyplot as plt

from collections import Counter

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, accuracy_score, f1_score
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier


## Data Cleaning

In [25]:
# Set the display option to show all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [26]:
df = pd.read_csv("../data/df.csv")

In [27]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,Title,Post Text,ID,Score,Total Comments,Post URL,subreddit,class
1260,1260,I {DA} will meet a prospective romantic partner,"I have decided I will meet a real, prospective...",v3mnxx,12,6,https://www.reddit.com/r/AvoidantAttachment/co...,AvoidantAttachment,0
1451,96,Something I came across,,11jzpu0,165,16,https://i.redd.it/4t0e38t926ma1.jpg,AnxiousAttachment,1
3138,1783,Time machine,I saw someone write about wishing they had a t...,11o9tli,2,3,https://www.reddit.com/r/AnxiousAttachment/com...,AnxiousAttachment,1
1168,1168,{FA} Avoidant towards siblings,I originally posted this in the anxious sub bu...,way9w5,12,12,https://www.reddit.com/r/AvoidantAttachment/co...,AvoidantAttachment,0
2209,854,Just started dating again. Scared and excited,"He (33M) is secure, I am leaning secure. First...",qfdyyh,40,6,https://www.reddit.com/r/AnxiousAttachment/com...,AnxiousAttachment,1
2999,1644,How to deal with my love interest needing space,I’m anxious attached and she feels dismissive;...,11uz5w5,2,1,https://www.reddit.com/r/AnxiousAttachment/com...,AnxiousAttachment,1
1681,326,Awareness Is Nothing Without Action,I began delving into attachment theory 2 years...,nxstma,75,28,https://www.reddit.com/r/AnxiousAttachment/com...,AnxiousAttachment,1
2191,836,I don't even know what to make of this.,,z0916c,40,25,https://i.redd.it/1f9t87y2h61a1.png,AnxiousAttachment,1
2265,910,"Me: ""I just want someone to be close and hold ...",,u02gud,38,2,https://i.redd.it/fcr8dy2ekks81.jpg,AnxiousAttachment,1
2908,1553,I've been lovebombed and lead on so many times...,I physically can't take it when men flirt with...,11yt8ci,15,3,https://www.reddit.com/r/AnxiousAttachment/com...,AnxiousAttachment,1


In [28]:
# Data Cleaning
# Selecting both the title and post text as the text to be analysed.

df["text"] = df["Title"] + ' ' + df["Post Text"].fillna('')

In [29]:
# the number of rows for each class is similiar, i.e. no sign of class imbalance

df["class"].value_counts()

1    1812
0    1355
Name: class, dtype: int64

In [30]:
# Remove punctuations and standardise to lowercase

def remove_punct(text):
    # store character only if it is not a punctuation
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

df["text_clean"] = df["text"].apply(lambda x: remove_punct(x.lower()))
df[["text", "text_clean"]].sample(5)

Unnamed: 0,text,text_clean
2043,how to stop rage when i’m triggered/disappoint...,how to stop rage when i’m triggereddisappointe...
2713,How do you deal with sex being a trigger for y...,how do you deal with sex being a trigger for y...
2447,What type of therapy is best for helping to ca...,what type of therapy is best for helping to ca...
1207,{FA} How do you find balance between proximity...,fa how do you find balance between proximity a...
672,What happens while an Avoidant takes space? As...,what happens while an avoidant takes space ask...


In [31]:
# Remove urls in the text

def remove_url(text):
    text_nourl = re.sub(r'\S*http\S*', '', text)
    return text_nourl

df["text_clean"] = df["text_clean"].apply(lambda x: remove_url(x))

df[["text", "text_clean"]].sample(10)

Unnamed: 0,text,text_clean
1299,"{da}? New here. Just need…something Ok, ramble...",da new here just need…something ok ramble aler...
1850,"The saddest part in all of this is, how can so...",the saddest part in all of this is how can som...
223,How avoidance releases dopamine {DA} {FA} such...,how avoidance releases dopamine da fa such a f...
2725,Feeling triggered I think ive been doing okay ...,feeling triggered i think ive been doing okay ...
2727,Does this mean I’m “disorganized”? I was prett...,does this mean i’m “disorganized” i was pretty...
2558,(FA) Do you ever feel like you’re stuck with y...,fa do you ever feel like you’re stuck with you...
236,Leaving people on read in justified avoidance ...,leaving people on read in justified avoidance ...
1211,{fa} feel trapped again. Need an advice Okay f...,fa feel trapped again need an advice okay firs...
2182,Taylor Swift anxiously attached? During therap...,taylor swift anxiously attached during therapy...
542,Flattened personality—just trying not to shut ...,flattened personality—just trying not to shut ...


In [32]:
# Remove words that contain digit
def remove_digit(text):
    text_nodigit = re.sub(r'\w*\d\w*', '', text)
    return text_nodigit

df["text_clean"] = df["text_clean"].apply(lambda x: remove_digit(x))

df[["text", "text_clean"]].sample(10)

Unnamed: 0,text,text_clean
1379,"Hell, is where we reside 🫠",hell is where we reside 🫠
843,I’m {fa} going to “break up” with my ex who is...,i’m fa going to “break up” with my ex who is a...
2670,FA / AP situationship advice and support wante...,fa ap situationship advice and support wanted...
241,{DA}{FA}{AP}{SA} How many anxious-avoidant dan...,dafaapsa how many anxiousavoidant dances did i...
2224,My anxiety leaves my body whenever I restart n...,my anxiety leaves my body whenever i restart n...
1405,The dark reality of being A Dismissive Avoidan...,the dark reality of being a dismissive avoidan...
478,{FA} My emotions feel like a monster I have to...,fa my emotions feel like a monster i have to h...
316,{fa} {da} {sa} What do you do to be consistent...,fa da sa what do you do to be consistent in yo...
231,{FA} After having a good time with people I al...,fa after having a good time with people i alwa...
1690,From now on I’m taking “I don’t know what I wa...,from now on i’m taking “i don’t know what i wa...


In [33]:
# Tokenize
def tokenize(text):  
    # /W matches any character that is neither alphanumeric nor underscoreb
    # Add a + just in case there are 2 or more spaces between certain words
    tokens = re.split('\W+', text)
    return tokens

df["text_token"] = df["text_clean"].apply(lambda x: tokenize(x)) 
df[["text", "text_token"]].sample(5)

Unnamed: 0,text,text_token
1907,Thanksgiving Wishing all AP’s here a very warm...,"[thanksgiving, wishing, all, ap, s, here, a, v..."
2669,Feels unfair I've tried really hard to underst...,"[feels, unfair, ive, tried, really, hard, to, ..."
1862,I did it :))))) goodbye DA! My bf was an angel...,"[i, did, it, goodbye, da, my, bf, was, an, ang..."
2520,Healing anxious attachment | By Anna Akana,"[healing, anxious, attachment, by, anna, akana, ]"
2685,"AA’s in long term, stable relationships- is th...","[aa, s, in, long, term, stable, relationships,..."


In [34]:
# List of default stopwords
stopword = nltk.corpus.stopwords.words('english')

# Remove stop words
def remove_stopwords(tokenized_list):
    #Store in text only if word is not found in stopword i.e. it is not a stopword
    text = [word for word in tokenized_list if word not in stopword]
    return text

df["text_stop"] = df["text_token"].apply(lambda x: remove_stopwords(x))
df[["text", "text_stop"]].sample(5)

Unnamed: 0,text,text_stop
3002,Got dumped by an emotionally unavailable man a...,"[got, dumped, emotionally, unavailable, man, p..."
2955,Does anyone else have RSD? For those unaware R...,"[anyone, else, rsd, unaware, rsd, rejection, s..."
1105,Help! I feel like my relationship is on its la...,"[help, feel, like, relationship, last, legs, d..."
571,What is exactly that in a potential parter tri...,"[exactly, potential, parter, trigger, fight, f..."
241,{DA}{FA}{AP}{SA} How many anxious-avoidant dan...,"[dafaapsa, many, anxiousavoidant, dances, take..."


In [35]:
# Lemmatizer
wn = nltk.WordNetLemmatizer()

def lemmatizing(tokenized_text):
    #return list of all lemmatized words for their corresponding words in tokenized_text
    text = [wn.lemmatize(word) for word in tokenized_text]
    return ' '.join(text)

df["text_lemmatise"] = df["text_stop"].apply(lambda x: lemmatizing(x))
df[["text", "text_lemmatise"]].sample(5)

Unnamed: 0,text,text_lemmatise
412,{da} Discovered one of the reasons why I’m avo...,da discovered one reason avoidant mother casua...
1969,anyone else fall into slumps where they don't ...,anyone else fall slump dont anything day partn...
434,"{FA} I said THE words. I said THE words. Yes, ...",fa said word said word yes word three word wor...
2832,"I could use some support/advice I (M, 28) am t...",could use supportadvice truly sorry place ask ...
154,Rant about avoidant-bashing {fa} I hate seeing...,rant avoidantbashing fa hate seeing comment pe...


In [36]:
# Identify duplicate words between both class
# Separate the text into two lists based on class
class_0_text = df[df['class'] == 0]["text_lemmatise"].tolist()
class_1_text = df[df['class'] == 1]["text_lemmatise"].tolist()

# Function to find and count duplicate words between both classes
def find_duplicate_words_between_classes(df, column_name, class_column_name):
    class_0_text = df[df[class_column_name] == 0][column_name].tolist()
    class_1_text = df[df[class_column_name] == 1][column_name].tolist()
    
    all_words_class_0 = ' '.join(class_0_text).split()
    all_words_class_1 = ' '.join(class_1_text).split()
    
    common_words = set(all_words_class_0).intersection(all_words_class_1)
    
    word_counts_class_0 = Counter(all_words_class_0)
    word_counts_class_1 = Counter(all_words_class_1)
    
    top_common_words_class_0 = {word: word_counts_class_0[word] for word in common_words if word in word_counts_class_0}
    top_common_words_class_1 = {word: word_counts_class_1[word] for word in common_words if word in word_counts_class_1}
    
    return pd.DataFrame({'Duplicate Words': list(common_words), 
                         'Count (Class 0)': [top_common_words_class_0.get(word, 0) for word in common_words],
                         'Count (Class 1)': [top_common_words_class_1.get(word, 0) for word in common_words]})

# Get the DataFrame containing duplicate words and counts for both classes
duplicate_words_df = find_duplicate_words_between_classes(df, "text_lemmatise", "class")

# Calculate the difference in counts between Class 0 and Class 1 for each word
duplicate_words_df['Count Difference'] = (duplicate_words_df['Count (Class 0)'] - duplicate_words_df['Count (Class 1)']).abs()

# Sort the DataFrame by the largest difference in counts (descending order)
duplicate_words_df = duplicate_words_df.sort_values(by='Count Difference', ascending=True)

# Display the duplicate words and counts for both classes
print(duplicate_words_df)
print(duplicate_words_df.shape)

           Duplicate Words  Count (Class 0)  Count (Class 1)  Count Difference
4123                puzzle                4                4                 0
1151              cocktail                1                1                 0
1150    anxiouspreoccupied                5                5                 0
1149                savior                1                1                 0
2793             agitation                1                1                 0
2795                 trade                1                1                 0
4896               salvage                2                2                 0
2800               colored                2                2                 0
2801           intensified                1                1                 0
2806            adjustment                2                2                 0
2807             backwards                6                6                 0
1138          overpowering                1         

In [37]:
# using the list of duplicate words whose frequency in class 0/1 differed by less than 10 as additional stopwords to further clean the corpus

additional_stopwords = duplicate_words_df[duplicate_words_df['Count Difference'].between(0, 10)]
stopword += additional_stopwords["Duplicate Words"].tolist()


In [38]:
additional_stopwords
additional_stopwords.sort_values(by='Count (Class 0)', ascending=False)

Unnamed: 0,Duplicate Words,Count (Class 0),Count (Class 1),Count Difference
2267,people,1146,1140,6
456,anyone,514,508,6
307,question,320,312,8
3921,anything,306,316,10
1640,post,305,295,10
2873,start,300,310,10
5014,maybe,278,285,7
6123,able,275,278,3
2595,many,250,252,2
2027,little,220,226,6


In [39]:
df["text_final"] = df["text_token"].apply(lambda x: remove_stopwords(x))
df["text_final"] = df["text_final"].apply(lambda x: lemmatizing(x))

In [40]:
df[["text", "text_final"]].sample(5)

Unnamed: 0,text,text_final
1676,Finally getting over DA This morning I did not...,finally getting da morning check phone name ch...
261,Activation/Deactivation: Understanding and wor...,activationdeactivation understanding working f...
460,“{da}” how do i get over assuming ppl don’t lo...,da get love feel like worked much almost read...
1057,Over a break-up quickly.. normal or not? {FA} ...,breakup normal fa day ago partner broke week r...
2246,Invest as much energy as you invest on to them...,much energy sequel last someone showing much i...


In [41]:
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Post Text,ID,Score,Total Comments,Post URL,subreddit,class,text,text_clean,text_token,text_stop,text_lemmatise,text_final
0,0,Seriously though {FA}{DA},,tqnp1u,511,19,https://i.redd.it/vv5etnapy7q81.jpg,AvoidantAttachment,0,Seriously though {FA}{DA},seriously though fada,"[seriously, though, fada, ]","[seriously, though, fada, ]",seriously though fada,seriously though fada
1,1,For all my favorite avoidants ❤️,,rpvbi1,457,2,https://i.redd.it/8yz268zr05881.jpg,AvoidantAttachment,0,For all my favorite avoidants ❤️,for all my favorite avoidants ❤️,"[for, all, my, favorite, avoidants, ]","[favorite, avoidants, ]",favorite avoidants,favorite avoidants
2,2,Anxious People on this subreddit: stop abandon...,If you’re anxious preoccupied or anxious leani...,plr9xd,439,65,https://www.reddit.com/r/AvoidantAttachment/co...,AvoidantAttachment,0,Anxious People on this subreddit: stop abandon...,anxious people on this subreddit stop abandoni...,"[anxious, people, on, this, subreddit, stop, a...","[anxious, people, subreddit, stop, abandoning,...",anxious people subreddit stop abandoning blami...,anxious stop blaming someone else anxious preo...
3,3,Same {FA},,syzhtu,349,6,https://i.redd.it/ta0rdmnlhgj81.jpg,AvoidantAttachment,0,Same {FA},same fa,"[same, fa, ]","[fa, ]",fa,fa
4,4,And that’s on self development,,r7y93d,329,6,https://i.redd.it/a9dh9u0oob381.jpg,AvoidantAttachment,0,And that’s on self development,and that’s on self development,"[and, that, s, on, self, development, ]","[self, development, ]",self development,self


In [42]:
df.shape

(3167, 15)

## Exploratory Data Analysis

#### EDA on Length and Word Count of post

In [None]:
# Create a new column called post_length that contains the length of each post
df['post_length'] = df['Post Text'].map(len)

In [None]:
# Explore data on graphs
sns.displot(df['post_length'], kde=True)

In [None]:
# Create a new column called post_word_count that contains the number of words in each post
df['post_word_count'] = df['Post Text'].map(lambda x: len(x.split()))

In [None]:
# Explore data on graphs
sns.displot(df['post_word_count'], kde=True)

#### EDA on Posts Content

In [None]:
# Show the five shortest posts based off of post_word_count
df.sort_values(by='post_word_count', ascending=True)[['Post Text']].head()

In [None]:
# Show the five longest posts based off of post_word_count
df.sort_values(by='post_word_count', ascending=False)[['Post Text']].head()

#### EDA on Post Score

In [None]:
# Filter the DataFrame where 'Label' is equal to 0 (Anxious Attachment style)
filtered_df = df[df['Class'] == 0]

# Sort the filtered DataFrame by 'Score' in ascending order and show the lowest posts scores
lowest_scores = filtered_df.sort_values(by='Score', ascending=True).head()

lowest_scores

In [None]:
# Show the highest posts based off of post scores
#df.sort_values(by='Score', ascending=False).head()

# Filter the DataFrame where 'Label' is equal to 1 (Avoidant Attachment style)
filtered_df = df[df['Class'] == 1]

# Sort the filtered DataFrame by 'Score' in ascending order and show the lowest posts scores
highest_scores = filtered_df.sort_values(by='Score', ascending=True).head()

highest_scores

## Machine Learning Models

### Baseline Model 

### Count Vectorizer & Naive Bayes (Bernoulli) Pipeline 

In [43]:
# Initiate train, test, split function
X = df['text_final']
y = df['class']

# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,random_state=42)

In [44]:
# Create a pipeline for Count Vectorizer and Naive Bayes (Bernoulli)
pipe_cv_berNB = Pipeline([
    ('cvec',  CountVectorizer()),
    ('berNB', BernoulliNB())])
# Define hyperparameters for grid search
pipe_cv_berNB_params = {'cvec__max_features' : [500],  # 100
                        'cvec__ngram_range' : [(1,2)], #(1,1),(1,2),(2,2),(2,3),(3,3),(1,3)
                       'berNB__binarize': [0.0]}       # 0.1, 0.5

# Instantiate GridSearchCV
gs_cv_berNB = GridSearchCV(pipe_cv_berNB, pipe_cv_berNB_params, cv=5)

# Fit GridSearch to training data
gs_cv_berNB.fit(X_train, y_train)

# Best score
print (f'Best Score: {gs_cv_berNB.best_score_}')

# Best parameters
print (f'Best Parameters: {gs_cv_berNB.best_params_}')

# Training score
print (f'Train Score: {gs_cv_berNB.score(X_train, y_train)}')

# Test score
print (f'Test Score: {gs_cv_berNB.score(X_test, y_test)}')

# y predict
y_pred_gs_cv_berNB = gs_cv_berNB.predict(X_test)
print(classification_report(y_test, y_pred_gs_cv_berNB))

Best Score: 0.8298947368421052
Best Parameters: {'berNB__binarize': 0.0, 'cvec__max_features': 500, 'cvec__ngram_range': (1, 2)}
Train Score: 0.8530526315789474
Test Score: 0.7878787878787878
              precision    recall  f1-score   support

           0       0.79      0.69      0.74       339
           1       0.79      0.86      0.82       453

    accuracy                           0.79       792
   macro avg       0.79      0.78      0.78       792
weighted avg       0.79      0.79      0.79       792



### Count Vectorizer & Naive Bayes (Multinomial) Pipeline

In [45]:
# Create a pipeline for Count Vectorizer and Naive Bayes (Multinomial)
pipe_cv_multiNB = Pipeline([
    ('cvec',  CountVectorizer()), 
    ('multiNB', MultinomialNB())])

# Define hyperparameters for grid search
pipe_cv_multiNB_params = {'cvec__max_features' : [100,500],  # comment what you have tried
                        'cvec__ngram_range' : [(1,1),(1,2),(2,2),(2,3),(3,3),(1,3)]}
                        

# Instantiate GridSearchCV    
gs_cv_multiNB = GridSearchCV(pipe_cv_multiNB, pipe_cv_multiNB_params, cv=5)

# Fit GridSearch to training data
gs_cv_multiNB.fit(X_train, y_train)

# Best score
print (f'Best Score: {gs_cv_multiNB.best_score_}')

# Best parameters
print (f'Best Parameters: {gs_cv_multiNB.best_params_}')

# Training score
print (f'Train Score: {gs_cv_multiNB.score(X_train, y_train)}')

# Test score
print (f'Test Score: {gs_cv_multiNB.score(X_test, y_test)}')

# y predict
y_pred_gs_cv_multiNB = gs_cv_multiNB.predict(X_test)
print(classification_report(y_test, y_pred_gs_cv_multiNB))

Best Score: 0.8463157894736841
Best Parameters: {'cvec__max_features': 500, 'cvec__ngram_range': (1, 2)}
Train Score: 0.8711578947368421
Test Score: 0.8181818181818182
              precision    recall  f1-score   support

           0       0.79      0.78      0.79       339
           1       0.84      0.85      0.84       453

    accuracy                           0.82       792
   macro avg       0.81      0.81      0.81       792
weighted avg       0.82      0.82      0.82       792



### TF-IDF Vectorizer & Logistic Regression Model

In [46]:
def tfidvectorize_split_smote_logistic(X, y, test_size=0.4, random_state=42):
    # Step 1: Apply Tfidf Vectorizer
    vectorizer = TfidfVectorizer(max_features=450, min_df=2, max_df=.9, ngram_range=(1,2))
    X_vectorized = vectorizer.fit_transform(X)
    
    # Parameters tested
    # max_features: [500, 2_000, 3_000, 4_000, 5_000]
    # min_df: [3]
    # max_df: [.95]
    # ngram_range:[(1,1) (2,2)]

    
    # Step 2: Perform train-test split
    X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=test_size, random_state=random_state)
    
    
    # Step 3: Apply SMOTE to the training data
    smote = SMOTE(random_state=random_state)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
    
    
    # Step 4: Train a logistic regression model
    classifier = LogisticRegression(solver='liblinear')
    classifier.fit(X_train_resampled, y_train_resampled) 
    
    
    #Step 5: Retrieve the coefficient and feature names 
    coef = classifier.coef_[0]
    features = vectorizer.get_feature_names_out() 
    feature_importance = pd.DataFrame({'Features': features, 'Importance': coef})
    # Separate positive and negative coefficients
    positive_coeffs = feature_importance[feature_importance['Importance'] > 0]
    negative_coeffs = feature_importance[feature_importance['Importance'] < 0]
    # Find the smallest positive coefficient
    largest_positive_coeff = positive_coeffs.nlargest(50, 'Importance')
    # Find the highest negative coefficient
    smallest_negative_coeff = negative_coeffs.nsmallest(50, 'Importance')
    
    
    # Step 6: Evaluate on training and test data
    train_predictions = classifier.predict(X_train_resampled)
    test_predictions = classifier.predict(X_test)
    
    train_accuracy = accuracy_score(y_train_resampled, train_predictions)
    test_accuracy = accuracy_score(y_test, test_predictions)
    
    # Score model on training set
    print(f'Train Score: {train_accuracy}')
    
    # Score model on testing set
    print(f'Test Score: {test_accuracy}')
    
    # Making predictions
    print(classification_report(y_test, test_predictions))
    
    return train_accuracy, test_accuracy, smallest_negative_coeff, largest_positive_coeff


# Invoke Function 
tfidvectorize_split_smote_logistic(X,y)

Train Score: 0.8896457765667575
Test Score: 0.8366219415943172
              precision    recall  f1-score   support

           0       0.80      0.83      0.82       556
           1       0.87      0.84      0.85       711

    accuracy                           0.84      1267
   macro avg       0.83      0.84      0.83      1267
weighted avg       0.84      0.84      0.84      1267



(0.8896457765667575,
 0.8366219415943172,
                 Features  Importance
 133                   fa   -8.137783
 89                    da   -3.611140
 35             avoidance   -2.629223
 36              avoidant   -2.544965
 37   avoidant attachment   -1.995748
 38             avoidants   -1.976578
 95          deactivation   -1.949121
 319                  run   -1.617409
 140              feeling   -1.354977
 421              wanting   -1.346707
 295              problem   -1.338258
 34                 avoid   -1.328077
 90                   dad   -1.228363
 415                 user   -1.221671
 262                  mom   -1.202109
 417           vulnerable   -1.191878
 5               actually   -1.151732
 117              emotion   -1.134870
 215                  kid   -1.134443
 135               family   -1.133051
 283               parent   -1.105523
 433               wonder   -1.094806
 141                 felt   -1.079692
 106           dismissive   -1.078986
 292    

In [52]:
# my one

# Create a pipeline for TF-IDF and Logistic Regression
pipe_tvec_logreg = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('logreg', LogisticRegression())])

# Define hyperparameters for grid search
# Maximum number of features fit: 2000, 3000, 4000, 5000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens), (individual tokens and bigrams), (bigrams), (trigrams)
# Regularization strength (C), penalty (l1: lasso or l2: ridge)
pipe_tvec_logreg_params = {
    'tvec__max_features': [200, 300, 400, 500],  #[20, 30, 40, 50, 2_000, 3_000, 4_000, 5_000]
    'tvec__min_df': [2, 3], #[2, 3]
    'tvec__max_df': [.9, .95], #[.9, .95]
    'tvec__ngram_range': [(1,1),(1,2),(2,2),(2,3),(3,3),(1,3)],   #[]
    'logreg__solver': ['liblinear'],  #['newton-cg', 'lbfgs', 'liblinear']
    'logreg__C': [1],  #[0.01, 0.1, 1.0, 10]
    'logreg__penalty': ['l2']}

# Instantiate GridSearchCV
gs_tvec_logreg = GridSearchCV(pipe_tvec_logreg, pipe_tvec_logreg_params, cv=5) 

# Fit GridSearch to training data
gs_tvec_logreg.fit(X_train, y_train)

# Best score
print (f'Best Score: {gs_tvec_logreg.best_score_}')

# Best parameters
print (f'Best Parameters: {gs_tvec_logreg.best_params_}')

# Train score
print (f'Train Score: {gs_tvec_logreg.score(X_train, y_train)}')

# Test score
print (f'Test Score: {gs_tvec_logreg.score(X_test, y_test)}')

# y predict
y_pred_gs_tvec_logreg = gs_tvec_logreg.predict(X_test)
print(classification_report(y_test, y_pred_gs_tvec_logreg))

              precision    recall  f1-score   support

           0       0.86      0.73      0.79       339
           1       0.82      0.91      0.86       453

    accuracy                           0.83       792
   macro avg       0.84      0.82      0.83       792
weighted avg       0.84      0.83      0.83       792



In [53]:
# kishan

def countvectorize_split_smote_logistic(X, y, test_size=0.4, random_state=30):
    # Step 1: Apply CountVectorizer
    vectorizer = CountVectorizer(max_features=25)
    X_vectorized = vectorizer.fit_transform(X)
    
    # Step 2: Perform train-test split
    X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=test_size, random_state=random_state)
    
    print("Class Breakdown (Before SMOTE) \n" + str(y_train.value_counts()) + "\n")
    
    # Step 3: Apply SMOTE to the training data
    smote = SMOTE(random_state=random_state)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
    
    print("Class Breakdown (After SMOTE) \n" +  str(y_train_resampled.value_counts()))
    
    # Step 4: Train a logistic regression model
    classifier = RandomForestClassifier(n_estimators=500)
    classifier.fit(X_train_resampled, y_train_resampled)
    
    # Step 5: Evaluate on training and test data
    train_predictions = classifier.predict(X_train_resampled)
    test_predictions = classifier.predict(X_test)
    
    train_accuracy = accuracy_score(y_train_resampled, train_predictions)
    test_accuracy = accuracy_score(y_test, test_predictions)
    
    print("\nTrain Acc:" + str(round(train_accuracy,2)) + " \nTest Acc:" + str(round(test_accuracy,2)))

In [54]:
# kishan

def countvectorize_split_smote_logistic(X, y, test_size=0.4, random_state=30):
    # Step 1: Apply CountVectorizer
    vectorizer = CountVectorizer(max_features=90)
    X_vectorized = vectorizer.fit_transform(X)
    
    # Step 2: Perform train-test split
    X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=test_size, random_state=random_state)
    
    print("Class Breakdown (Before SMOTE) \n" + str(y_train.value_counts()) + "\n")
    
    # Step 3: Apply SMOTE to the training data
    smote = SMOTE(random_state=random_state)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
    
    print("Class Breakdown (After SMOTE) \n" +  str(y_train_resampled.value_counts()))
    
    # Step 4: Train a logistic regression model
    classifier = BaggingClassifier(n_estimators=200, max_samples=0.2)
    classifier.fit(X_train_resampled, y_train_resampled)
    
    # Step 5: Evaluate on training and test data
    train_predictions = classifier.predict(X_train_resampled)
    test_predictions = classifier.predict(X_test)
    
    train_accuracy = accuracy_score(y_train_resampled, train_predictions)
    test_accuracy = accuracy_score(y_test, test_predictions)
    
    print("\nTrain Acc:" + str(round(train_accuracy,2)) + " \nTest Acc:" + str(round(test_accuracy,2)))

In [None]:
# insert the model results table and explain

## Conclusions and Recommendations