PURPOSE OF THIS NOTEBOOK:  
In this notebook, we turn raw,human readable information into numerical representations that a model can learn from without leaking the target.  
_leaking the target: model gets hints about the answer accidently while training_  
- This happens when something in the input given for training contains information that would not be available at predictio time.

_TEXT COLUMNS :_  
- For the model, different text sections don't matter. They are just words. So we concatenate text by presenting it with everything the employer offers.

In [41]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv("../data/processed/cleaned_job_postings.csv")
df.shape
df.columns

In [None]:
df["is_remote"] = df["location"].str.lower().str.contains("remote").astype(int)
df=df.drop(columns=["salary_range","department","location"])

In [None]:
text_columns = [
    "title", 
    "company_profile", 
    "description",
    "requirements",
    "benefits"
    ]
binary_columns = [
    "telecommuting",
    "has_company_logo",
    "has_questions",
]

categorical_columns = [
    "is_remote", 
    "employment_type",
    "required_experience",
    "required_education",
    "industry",
    "function",
]

target_column="fraudulent"

In [None]:
# Force text columns into string
for col in text_columns:
    df[col]=df[col].astype(str)

In [None]:
df["full_text"]=df[text_columns].agg(" ".join,axis=1)
# Joins text row wise, inserting spaces between sections
df=df.drop(columns=text_columns)

In [None]:
x=df.drop(columns=[target_column])
y=df[target_column]


In [None]:
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)
# While splitting, preserve the original class distribution
# Test size= 0.2 implies 20% of the data will be used for testing, and 80% for training
# X is the feature set, and y is the target variable
"""
X_train : Features used to train the model
Y_train : Correct answers for the training data
X_test : Features used to evaluate the model's performance
Y_test : Correct answers for the testing data"""
y_train.value_counts(normalize=True)
y_test.value_counts(normalize=True)

fraudulent
0    0.951622
1    0.048378
Name: proportion, dtype: float64

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
#To convert text data into numerical features using TF-IDF vectorization
tfidf=TfidfVectorizer(max_features=5000, stop_words="english",min_df=5)

#stop_words remove common words that do not carry much meaning (e.g., "the", "is", "and")
#min_df=5 means that only words that appear in at least 5 documents will be included
#max_features=5000 limits the number of features to the top 5000 most important words based on their TF-IDF scores

In [None]:
X_train_text = tfidf.fit_transform(X_train["full_text"])
X_test_text=tfidf.transform(X_test["full_text"])
"""
fit_tranform learns the vocabulary from the training data and transforms it into a matrix of TF-IDF features.
transform applies the same transformation to the test data using the vocabulary learned from the training data.
"""
X_train_text.shape

(14304, 5000)

In [None]:
X_train_text.shape

(14304, 5000)

In [None]:
X_train_structured=X_train.drop(columns=["full_text"])
X_test_structured=X_test.drop(columns=["full_text"])

In [None]:
for col in categorical_columns:
    print(col,":",df[col].nunique)

is_remote : <bound method IndexOpsMixin.nunique of 0        0
1        0
2        0
3        0
4        0
        ..
17875    0
17876    0
17877    0
17878    0
17879    0
Name: is_remote, Length: 17880, dtype: int64>
employment_type : <bound method IndexOpsMixin.nunique of 0            Other
1        Full-time
2          Unknown
3        Full-time
4        Full-time
           ...    
17875    Full-time
17876    Full-time
17877    Full-time
17878     Contract
17879    Full-time
Name: employment_type, Length: 17880, dtype: object>
required_experience : <bound method IndexOpsMixin.nunique of 0              Internship
1          Not Applicable
2                 Unknown
3        Mid-Senior level
4        Mid-Senior level
               ...       
17875    Mid-Senior level
17876    Mid-Senior level
17877             Unknown
17878      Not Applicable
17879    Mid-Senior level
Name: required_experience, Length: 17880, dtype: object>
required_education : <bound method IndexOpsMixin.nunique of

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder(handle_unknown="ignore", sparse_output=False)
#"ignore" means that if the encoder encounters a category in the test data that it did not see during training, it will ignore it instead of raising an error.
# sparse_output=False means that the output will be a dense array instead of a sparse matrix, which can be easier to work with for small datasets.

In [None]:
X_train_categorical = encoder.fit_transform(X_train_structured[categorical_columns])
#Learns all category levels and converts the categorical features in the training set into a one-hot encoded format.
X_test_categorical= encoder.transform(X_test_structured[categorical_columns])
# Transorm test categorical data
X_train_binary=X_train_structured[binary_columns].values
X_test_binary=X_test_structured[binary_columns].values

In [None]:
from scipy.sparse import hstack
X_train_final=hstack([X_train_text,X_train_categorical,X_train_binary])
X_test_final=hstack([X_test_text,X_test_categorical,X_test_binary])

In [None]:
import joblib
# Save TF-IDF vectorizer
joblib.dump(tfidf, "../models/tfidf_vectorizer.pkl")

# Save encoder
joblib.dump(encoder, "../models/onehot_encoder.pkl")

['../models/onehot_encoder.pkl']

In [None]:
X_test_final.shape



(3576, 5201)

In [None]:
from scipy.sparse import save_npz
import numpy as np

In [None]:
# Save sparse feature matrices
save_npz("../data/processed/X_train_final.npz", X_train_final)
save_npz("../data/processed/X_test_final.npz", X_test_final)

# Save target arrays
np.save("../data/processed/y_train.npy", y_train.values)
np.save("../data/processed/y_test.npy", y_test.values)