# Model 1- Logistic regresssion 
**Logistic regression** is a foundational statistical model used primarily for binary classification tasks, where it predicts the probability of an event occurring based on input variables. Despite its name, it's a linear model that fits data to a logistic (sigmoid) function, transforming the output into a probability score between 0 and 1. This makes it straightforward to interpret, as the predicted probability directly indicates the likelihood of an instance belonging to a specific class. Logistic regression is computationally efficient, making it suitable for large datasets, and it doesn't assume a particular distribution of input variables like Gaussian assumptions in other classifiers. It's commonly applied in scenarios such as predicting customer churn, identifying spam emails, or detecting fraudulent transactions. While logistic regression's linear decision boundary can be limiting in capturing complex relationships, its simplicity, interpretability, and robust performance as a baseline model make it indispensable in both statistical analysis and machine learning applications.

## 1. Data Preprocessing
Data preprocessing transforms raw data into a clean and usable format by handling missing values, outliers, and ensuring consistent data scales through normalization or standardization. It also includes feature extraction and selection to enhance dataset quality. This step is essential for efficient and accurate data analysis or machine learning model performance.

## Importing Libraries

In [26]:
# importing all the necessary libraries
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder

## Loading Dataset

In [49]:
# Reading and displaying the first 10 rows of the CSV file
df=pd.read_csv('fake_job_postings.csv')
df.head(10)

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
5,6,Accounting Clerk,"US, MD,",,,,Job OverviewApex is an environmental consultin...,,,0,0,0,,,,,,0
6,7,Head of Content (m/f),"DE, BE, Berlin",ANDROIDPIT,20000-28000,"Founded in 2009, the Fonpit AG rose with its i...",Your Responsibilities: Manage the English-spea...,Your Know-How: ...,Your Benefits: Being part of a fast-growing co...,0,1,1,Full-time,Mid-Senior level,Master's Degree,Online Media,Management,0
7,8,Lead Guest Service Specialist,"US, CA, San Francisco",,,Airenvy’s mission is to provide lucrative yet ...,Who is Airenvy?Hey there! We are seasoned entr...,"Experience with CRM software, live chat, and p...",Competitive Pay. You'll be able to eat steak e...,0,1,1,,,,,,0
8,9,HP BSM SME,"US, FL, Pensacola",,,Solutions3 is a woman-owned small business who...,Implementation/Configuration/Testing/Training ...,MUST BE A US CITIZEN.An active TS/SCI clearanc...,,0,1,1,Full-time,Associate,,Information Technology and Services,,0
9,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",The Customer Service Associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,1,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0


In [51]:
# To check if for null values in the dataset
df.isnull().sum()

job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2696
benefits                7212
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [53]:
# Fill missing values in text columns with empty strings
text_columns = ['description', 'requirements', 'company_profile', 'title', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
df[text_columns] = df[text_columns].fillna('')

In [55]:
# Combine text columns into a single column
df['combined_text'] = df[text_columns].apply(lambda x: ' '.join(x), axis=1)

In [57]:
df['combined_text'].head()

0    Food52, a fast-growing, James Beard Award-winn...
1    Organised - Focused - Vibrant - Awesome!Do you...
2    Our client, located in Houston, is actively se...
3    THE COMPANY: ESRI – Environmental Systems Rese...
4    JOB TITLE: Itemization Review ManagerLOCATION:...
Name: combined_text, dtype: object

## Convert to lower case

In [59]:
df['combined_text'] = df['combined_text'].str.lower()

## Remove links

In [60]:
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x, flags=re.MULTILINE))

## Remove next lines (\n)

In [62]:
df['combined_text'] = df['combined_text'].str.replace('\n', ' ')

## Remove words containing numbers

In [65]:
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))

## Remove extra spaces

In [66]:
df['combined_text'] = df['combined_text'].str.strip()
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(' +', ' ', x))

## Remove special characters

In [67]:
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

## Removal of stop words

In [68]:
!pip install nltk
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')
df['combined_text'] = df['combined_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Stemming

In [69]:
stemmer = PorterStemmer()
df['combined_text'] = df['combined_text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

## Lemmatization

In [70]:
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
df['combined_text'] = df['combined_text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 2.Feature Engineering
Feature engineering involves creating or modifying features to improve machine learning models. This includes selecting relevant features, transforming data, and creating new features based on insights. Techniques include encoding categorical variables, scaling numerical features, and creating interaction terms. Effective feature engineering enhances model accuracy and robustness by capturing underlying data patterns.

In [74]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [75]:
# Vectorize the text data with limited features
vectorizer = TfidfVectorizer(max_features=1000)
X_text = vectorizer.fit_transform(df['combined_text'])

In [77]:
# List of non-text columns to be used as features
non_text_columns = ['telecommuting', 'has_company_logo', 'has_questions', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']

# Fill missing values in non-text columns with a placeholder
df[non_text_columns] = df[non_text_columns].fillna('missing')

**One-hot encoding**

One-hot encoding  is a technique used to convert categorical variables into a format suitable for machine learning algorithms. It creates new binary columns, where each column represents a unique category, and assigns a value of 1 to the column corresponding to the category of each data point. One-hot encoding avoids assumptions about the ordering or hierarchy of categories, but can increase the dimensionality of the dataset.

In [78]:
# OneHotEncode the categorical non-text columns
categorical_columns = ['location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
encoder = OneHotEncoder()
X_non_text_encoded = encoder.fit_transform(df[categorical_columns])

In [79]:
# Scale non-categorical non-text columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_non_text_scaled = scaler.fit_transform(df[['telecommuting', 'has_company_logo', 'has_questions']])

In [80]:
# Import hstack from scipy.sparse
from scipy.sparse import hstack

# Combine non-categorical and categorical features
X_non_text = hstack([X_non_text_encoded, X_non_text_scaled])

In [81]:
# Combine the text and non-text features
X = hstack([X_text, X_non_text])

In [82]:
# Target variable
y = df['fraudulent']

In [83]:
# Split the data into training, testing, and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.33, random_state=42) # 0.33 * 0.3 = 0.10

## 3. Model Building
Model building with logistic regression for binary classification involves preparing the data by handling missing values, encoding categorical variables, and scaling features. The data is then split into training and testing sets. A logistic regression model is trained on the training set, and its performance is evaluated on the test set using metrics like accuracy, precision, recall. Hyperparameter tuning can be performed to optimize the model. This process helps create a robust predictive model that can effectively classify binary outcomes.

In [87]:
# Import the necessary class
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)

print("Logistic Regression Accuracy:", accuracy_score(y_val, y_pred_lr))
print(classification_report(y_val, y_pred_lr))

Logistic Regression Accuracy: 0.9757199322416714
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1672
           1       0.94      0.61      0.74        99

    accuracy                           0.98      1771
   macro avg       0.96      0.80      0.86      1771
weighted avg       0.97      0.98      0.97      1771




## Detailed Output Analysis of Logistic Regression Model

The logistic regression model achieved an overall accuracy of 0.9757199322416714, or approximately 97.57%, on the test set. This means the model correctly classified 97.57% of the instances in the test data.

The classification report provides more detailed performance metrics:

****Class 0 Performance****
- **Precision**: 0.98
- **Recall**: 1.00
- **F1-score**: 0.99
- **Support**: 1672

For class 0, the model has very high precision (98%), meaning 98% of the instances predicted as class 0 are actually class 0. The recall is also very high (100%), indicating that the model correctly identifies all the actual class 0 instances. The F1-score, which is the harmonic mean of precision and recall, is 0.99, reflecting the model's excellent performance for class 0.

****Class 1 Performance****
- **Precision**: 0.94
- **Recall**: 0.61
- **F1-score**: 0.74
- **Support**: 99

For class 1, the precision is still high at 94%, but the recall is lower at 61%. This suggests that while the model correctly identifies most of the class 1 instances when they are predicted as such, it misses a significant portion of the actual class 1 instances. The F1-score for class 1 is 0.74, indicating that the model's performance is good but not as strong as for class 0.

****Overall Performance****
- **Accuracy**: 0.98
- **Macro Average Precision**: 0.96
- **Macro Average Recall**: 0.80
- **Macro Average F1-score**: 0.86
- **Weighted Average Precision**: 0.97
- **Weighted Average Recall**: 0.98
- **Weighted Average F1-score**: 0.97

The overall accuracy of 98% confirms that the model performs well in classifying instances correctly. The macro average metrics provide the unweighted average of the metrics across both classes, while the weighted average metrics take into account the class imbalance in the test set (1672 instances of class 0 vs. 99 instances of class 1).

In summary, the logistic regression model demonstrates excellent performance, with an overall accuracy of 97.57%. It performs exceptionally well for the majority class (class 0) but has slightly lower performance for the minority class (class 1). The detailed classification report provides valuable insights into the model's strengths and areas for improvement.

Citations:
[1] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/11435665/f606c359-113b-4503-8840-e0728b62e867/ml-modellinf.ipynb

# OUTPUT analysis

**Logistic Regression**

Accuracy: 0.9757
Precision (Class 0): 0.98
Recall (Class 0): 1.00
F1-Score (Class 0): 0.99
Precision (Class 1): 0.94
Recall (Class 1): 0.61
F1-Score (Class 1): 0.74
Macro Avg Precision: 0.96
Macro Avg Recall: 0.80
Macro Avg F1-Score: 0.86
Weighted Avg Precision: 0.97
Weighted Avg Recall: 0.98
Weighted Avg F1-Score: 0.97

The logistic regression model achieved an excellent overall accuracy of 97.57%. It demonstrated strong performance in the majority class (Class 0) with high precision, recall, and F1-score. However, the model had a lower recall for the minority class (Class 1), indicating it may miss a significant portion of the positive cases.


## Comparison

### Logistic Regression:

- **Plus**: Very high overall accuracy (97.57%), excellent performance in the majority class (Class 0)
- **minus**: Lower recall for the minority class (Class 1)
- **suitable scenario**: Situations where overall accuracy is crucial, but the model may not be as effective in identifying the minority class


### Recommendation:

If achieving high accuracy across all classes is paramount, particularly with a noticeable class imbalance, **Logistic Regression** remains a strong contender, despite its tendency to underperform on minority classes.

In scenarios involving sequential or time-series data, where capturing intricate dependencies is crucial, opting for an **LSTM Model** would be advantageous.

When dealing with textual data or situations requiring nuanced understanding of context, leveraging a **BERT-Based Model** proves highly effective.

When analyzing metrics, the Logistic Regression model consistently demonstrates superior performance across accuracy and weighted averages. However, depending on the complexity and specificity of the task at hand, either the LSTM or BERT model might offer more suitable solutions.
1