# **A model for classifying jokes to adult and clean**
In this notebook explore training a model to classify jokes to adult and clean.


In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import scipy.sparse
from scipy.special import expit as sigmoid  # Import the sigmoid function
import joblib


# pd cosmetics
pd.set_option('display.max_colwidth', 3000)
pd.set_option('display.max_rows', 3000)
pd.set_option('display.max_columns', 3000)
pd.set_option('display.width', 1000)

df_jokes_slim = pd.read_csv('./data/reddit_jokes_slim_processed.csv')
df_jokes_slim.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37215 entries, 0 to 37214
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   index                37215 non-null  int64  
 1   thread_id            37215 non-null  object 
 2   thread_title         37215 non-null  object 
 3   thread_selftext      37215 non-null  object 
 4   thread_score         37215 non-null  float64
 5   thread_num_comments  37215 non-null  float64
 6   thread_created_utc   37215 non-null  object 
 7   thread_upvote_ratio  37215 non-null  float64
 8   thread_over_18       37215 non-null  bool   
 9   thread_created_pst   37215 non-null  object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 2.6+ MB


In [29]:
adult_jokes = df_jokes_slim[df_jokes_slim.thread_over_18 == True]
clean_jokes  = df_jokes_slim[df_jokes_slim.thread_over_18 == False]

print (f'number of clean jokes in dataset = {len(clean_jokes)}')
print (f'number of adult jokes in dataset = {len(adult_jokes)}')

number of clean jokes in dataset = 31576
number of adult jokes in dataset = 5639


## Loggistic regression
We are dealing with imbalance data set as the number of clean jokes is much higher than the number of adult jokes. We use Synthatic Minority Over-sampling Technique  (SMOTE) to balance two classes. 

In [30]:

# Load the dataset
jokes_df = pd.read_csv('./data/reddit_jokes_slim_processed.csv')

# Combine the title and selftext into one column
jokes_df['combined_text'] = jokes_df['thread_title'] + ' ' + jokes_df['thread_selftext']

# Text Encoding with TF-IDF
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(jokes_df['combined_text'])
joblib.dump(tfidf, 'models/tfidf_vectorizer.pkl')  # Save the fitted vectorizer



# Combine the matrices
#X = scipy.sparse.hstack([tfidf_matrix_title, tfidf_matrix_text])

# Target variable
y = jokes_df['thread_over_18']

# Handling Class Imbalance with SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)

# Standardize features
#scaler = StandardScaler(with_mean=False)
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

# Model Building
model = LogisticRegression(max_iter= 2000,verbose=1)
model.fit(X_train, y_train)

# Model Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

joblib.dump(model, 'models/jokes_adult_clean_classifier_logreg.pkl')


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =        39362     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.50185D+04    |proj g|=  3.04690D+02


 This problem is unconstrained.



At iterate   50    f=  1.99301D+04    |proj g|=  4.96962D+01

At iterate  100    f=  1.98036D+04    |proj g|=  9.46964D+00

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
39362    143    170      1     0     0   5.396D-02   1.980D+04
  F =   19802.654313255058     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
              precision    recall  f1-score   support

       False       0.87      0.87      0.87      6250
        True       0.88      0.88      0.88      6381

    accuracy                           0.88     12631
   macro avg       0.88      0.88      0.88     12631
weighted avg       0.88    

['models/jokes_adult_clean_classifier_logreg.pkl']

In [31]:

import re
#tfidf_vectorizer = tfidf

# Example jokes
jokes = [
    "Why don't scientists trust atoms? Because they make up everything!",
"What do you call a fake noodle? An impasta!",
"How do you organize a space party? You planet!",
"What's an astronaut's favorite part of the computer? The space bar.",
"Why did the bicycle fall over? It was two-tired.",
"What do you call cheese that isn't yours? Nacho cheese.",
"Why couldn't the bicycle stand up by itself? It was two-tired.",
"How does a penguin build its house? Igloos it together.",
"Why don’t skeletons fight each other? They don’t have the guts.",
"What did the grape do when he got stepped on? He let out a little wine.",
"I told my wife she should embrace her mistakes. She gave me a hug.",
"Why don't some couples go to the gym? Because some relationships don't work out!",
"I'm reading a book on the history of glue. Can't put it down.",
"I told my suitcases there will be no vacation this year. Now I'm dealing with emotional baggage.",
"It's inappropriate to make a 'dad joke' if you're not a dad. It's a faux pa.",
"I used to play piano by ear, but now I use my hands.",
"What did the toaster say to the slice of bread? 'I want you inside me.'",
"'Give it to me! Give it to me!' she yelled. 'I'm so wet, give it to me now!' She could scream all she wanted, but I was keeping the umbrella.",
"Parallel lines have so much in common. It’s a shame they’ll never meet.",
"My wife told me to take the spider out instead of killing it. We went and had some drinks. Cool guy, wants to be a web developer."
]

# Preprocessing function (customize this according to how your data was preprocessed)
def preprocess_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(r'\d+', '', text)  # Removing numbers
    text = re.sub(r'\s+', ' ', text)  # Removing extra spaces
    text = re.sub(r'[^\w\s]', '', text)  # Removing punctuation
    return text

# Preprocess jokes
processed_jokes = [preprocess_text(joke) for joke in jokes]

# Vectorize jokes
tfidf_vectorizer = joblib.load('models/tfidf_vectorizer.pkl')
jokes_vectorized = tfidf_vectorizer.transform(processed_jokes)

model = joblib.load('models/jokes_adult_clean_classifier_logreg.pkl')
# Prediction
predictions = model.predict(jokes_vectorized)

# Output results
for joke, pred in zip(jokes, predictions):
    print(f"Joke: {joke}\nClassified as: {'Adult' if pred else 'Clean'}\n")

Joke: Why don't scientists trust atoms? Because they make up everything!
Classified as: Clean

Joke: What do you call a fake noodle? An impasta!
Classified as: Clean

Joke: How do you organize a space party? You planet!
Classified as: Clean

Joke: What's an astronaut's favorite part of the computer? The space bar.
Classified as: Clean

Joke: Why did the bicycle fall over? It was two-tired.
Classified as: Clean

Joke: What do you call cheese that isn't yours? Nacho cheese.
Classified as: Clean

Joke: Why couldn't the bicycle stand up by itself? It was two-tired.
Classified as: Clean

Joke: How does a penguin build its house? Igloos it together.
Classified as: Clean

Joke: Why don’t skeletons fight each other? They don’t have the guts.
Classified as: Clean

Joke: What did the grape do when he got stepped on? He let out a little wine.
Classified as: Clean

Joke: I told my wife she should embrace her mistakes. She gave me a hug.
Classified as: Clean

Joke: Why don't some couples go to the 