# Predictive model

Basically, a predictive model is a mathematical function that, applied to a set of data, can identify hidden patterns, and based on those patterns, make predictions. The purpose of machine learning, Machine Learning, is to "learn" the approximation of the function that best represents the relationship between the input attributes, predictor variables, with the output variable, which we want to predict.

Machine learning algorithms are mostly divided into 3 types: Supervised Learning, Unsupervised Learning and Reinforcement Learning.


## The Process

In a simplified way, the process for creating a predictive model is composed by the sequence of activities:

1. Data Collection;
2. Data Exploration and Preparation;
3. Model training;
4. Model Evaluation;
5. Model optimization;

These activities are performed iteratively, modifying parameters, organizing data, obtaining new data, testing algorithms, creating new variables, among others, until you have the ideal model to solve the business problem we are working on.


## Data Preparation

Normally, the creation of the predictive model itself is very fast compared to the time that the data scientist needs to dedicate to prepare the data set, because the data can come from different sources, in different formats, with errors or requiring manipulations. It is up to us to carry out the activities that will ensure that the predictive model correctly receives the input information. The data preparation step, which we did earlier, is extremely important to ensure the accuracy of our model. Taking for granted that bad input data will generate bad outputs, it must be assumed that we should never neglect this step.


## Choice of Predictor Variables

We will classify our variables into two types:

- **Predictor variables**: are the variables that will be used as input to the predictive model;
- **Target variables**: these are the variables we want to predict;

As previously mentioned, we want to predict the price of accommodation at an airbnb establishment in New York, NY, based on the data set obtained on the Kaggle website. So our target variable is the price column.

Our predictor variables will be: neighborhood_group, neighborhood, latitude, longitude, room_type, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365


## Spliting Training and Testing

We should divide our data set into 2 parts, one for training and the other for testing the predictive model. There is no general rule for dividing the data, we will use an 80/20 division (80% of the data for training and 20% of the data for testing). What we need to be aware of is for the separate sample to reliably represent our entire population of data.

## Import dependencies

In [2]:
# pandas - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
import pandas as pd

# NumPy - The fundamental package for scientific computing.
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import balanced_accuracy_score

## Preparation of the Dataset

In [3]:
path = "../src/dataset/fake_job_postings.csv"
index_column = "job_id"

df = pd.read_csv(path, index_col=index_column)
df = df.fillna("Missing")

df['Empresa'] = df['title']  + ' ' + df['company_profile'] + ' ' + df['description'] + ' ' + df['requirements'] + ' ' + df['benefits']

del df['title']
del df['location']
del df['company_profile']
del df['description']
del df['requirements']
del df['benefits']
del df['salary_range']

df['required_experience'] = df['required_experience'].astype('category') 
required_experience_categories = df['required_experience'].cat.categories
df['required_experience'].cat.categories = range(len(required_experience_categories))

df['required_experience'] = df['required_experience'].astype('category') 
required_experience_categories = df['required_experience'].cat.categories
df['required_experience'].cat.categories = range(len(required_experience_categories))

df['employment_type'] = df['employment_type'].astype('category') 
employment_type_categories = df['employment_type'].cat.categories
df['employment_type'].cat.categories = range(len(employment_type_categories))

df['required_education'] = df['required_education'].astype('category') 
required_education_categories = df['required_education'].cat.categories
df['required_education'].cat.categories = range(len(required_education_categories))

df['function'] = df['function'].astype('category') 
function_categories = df['function'].cat.categories
df['function'].cat.categories = range(len(function_categories))

df['industry'] = df['industry'].astype('category') 
industry_categories = df['industry'].cat.categories
df['industry'].cat.categories = range(len(industry_categories))

df['department'] = df['department'].astype('category') 
department_categories = df['department'].cat.categories
df['department'].cat.categories = range(len(department_categories))

In [4]:
df

Unnamed: 0_level_0,department,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,Empresa
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,758,0,1,0,3,4,6,83,22,0,"Marketing Intern We're Food52, and we've creat..."
2,1162,0,1,0,1,7,6,75,7,0,Customer Service - Cloud Video Production 90 S...
3,793,0,1,0,2,6,6,83,23,0,Commissioning Machinery Assistant (CMA) Valor ...
4,1055,0,1,0,1,5,1,22,32,0,Account Executive - Washington DC Our passion ...
5,793,0,1,1,1,5,1,51,16,0,Bill Review Manager SpotSource Solutions LLC i...
...,...,...,...,...,...,...,...,...,...,...,...
17876,1055,0,1,1,1,5,6,22,32,0,Account Director - Distribution Vend is looki...
17877,62,0,1,1,1,5,1,61,0,0,Payroll Accountant WebLinc is the e-commerce p...
17878,793,0,0,0,1,6,6,83,23,0,Project Cost Control Staff Engineer - Cost Con...
17879,793,0,0,1,0,7,7,48,9,0,Graphic Designer Missing Nemsia Studios is loo...


In [8]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Empresa']).toarray()
y = df["fraudulent"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

method = GaussianNB()
model = method.fit(X_train, y_train)

MemoryError: Unable to allocate 13.9 GiB for an array with shape (17880, 104205) and data type float64

In [7]:
y_pred = model.predict(X_test)

score = balanced_accuracy_score(y_test, y_pred)

print(f'Accuracy {score*100:.2f}%')

NameError: name 'score' is not defined