# Predictive model

Basically, a predictive model is a mathematical function that, applied to a set of data, can identify hidden patterns, and based on those patterns, make predictions. The purpose of machine learning, Machine Learning, is to "learn" the approximation of the function that best represents the relationship between the input attributes, predictor variables, with the output variable, which we want to predict.

Machine learning algorithms are mostly divided into 3 types: Supervised Learning, Unsupervised Learning and Reinforcement Learning.


## The Process

In a simplified way, the process for creating a predictive model is composed by the sequence of activities:

1. Data Collection;
2. Data Exploration and Preparation;
3. Model training;
4. Model Evaluation;
5. Model optimization;

These activities are performed iteratively, modifying parameters, organizing data, obtaining new data, testing algorithms, creating new variables, among others, until you have the ideal model to solve the business problem we are working on.


## Data Preparation

Normally, the creation of the predictive model itself is very fast compared to the time that the data scientist needs to dedicate to prepare the data set, because the data can come from different sources, in different formats, with errors or requiring manipulations. It is up to us to carry out the activities that will ensure that the predictive model correctly receives the input information. The data preparation step, which we did earlier, is extremely important to ensure the accuracy of our model. Taking for granted that bad input data will generate bad outputs, it must be assumed that we should never neglect this step.


## Choice of Predictor Variables

We will classify our variables into two types:

- **Predictor variables**: are the variables that will be used as input to the predictive model;
- **Target variables**: these are the variables we want to predict;

As previously mentioned, we want to predict the price of accommodation at an airbnb establishment in New York, NY, based on the data set obtained on the Kaggle website. So our target variable is the price column.

Our predictor variables will be: neighborhood_group, neighborhood, latitude, longitude, room_type, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365


## Spliting Training and Testing

We should divide our data set into 2 parts, one for training and the other for testing the predictive model. There is no general rule for dividing the data, we will use an 80/20 division (80% of the data for training and 20% of the data for testing). What we need to be aware of is for the separate sample to reliably represent our entire population of data.

## Import dependencies

In [2]:
# pandas - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
import pandas as pd

# NumPy - The fundamental package for scientific computing.
import numpy as np

# matplotlib - A comprehensive library for creating static, animated, and interactive visualizations
import matplotlib.pyplot as plt

# seaborn - statistical data visualization
import seaborn as sns

## Exploration and Preparation of the Dataset

Pandas allows us to transform our dataset into a dataframe, which transforms the data into a structure of 2 dimensions, rows and columns, below we can visualize the data in the form of a table.

### Columns
- **job_id** - Unique Job ID
- **title** - The title of the job ad entry.
- **location** - Geographical location of the job ad.
- **department** - Corporate department (e.g. sales).
- **salary_range** - Indicative salary range (e.g. $50,000-$60,000)
- **company_profile** - A brief company description.
- **description** - The details description of the job ad.
- **requirements** - Enlisted requirements for the job opening.
- **benefits** - Enlisted offered benefits by the employer.
- **telecommuting** - True for telecommuting positions.
- **has_company_logo** - True if company logo is present.
- **has_questions** - True if screening questions are present.
- **employment_type** - Full-type, Part-time, Contract, etc.
- **required_experience** - Executive, Entry level, Intern, etc.
- **required_education** - Doctorate, Master’s Degree, Bachelor, etc.
- **industry** - Automotive, IT, Health care, Real estate, etc.
- **function** - Consulting, Engineering, Research, Sales etc.
- **fraudulent** - target - Classification attribute.

In [3]:
index_column = "job_id"

dtypes = {
    "title": "string",
    "location": "string",
    "department": "string",
    "salary_range": "string",
    "company_profile": "string",
    "description": "string",
    "requirements": "string",
    "benefits": "string",
    "telecommuting": "category",
    "has_company_logo": "category",
    "has_questions": "category",
    "employment_type": "string",
    "required_experience": "string",
    "required_education": "string",
    "industry": "string",
    "function": "string",
    "fraudulent": "category"
}

path = "../src/dataset/fake_job_postings.csv"

df = pd.read_csv(path, index_col=index_column, dtype=dtypes)


In [4]:
df.shape

(17880, 17)

In [5]:
categories = {
    "telecommuting": ["no", "yes"],
    "has_company_logo": ["no", "yes"],
    "has_questions": ["no", "yes"],
    "fraudulent": ["no", "yes"]
}

for column, column_categories in categories.items():
    df[column].cat.categories = column_categories


In [6]:
df

Unnamed: 0_level_0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,no,yes,no,Other,Internship,,,Marketing,no
2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,no,yes,no,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,no
3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,no,yes,no,,,,,,no
4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,no,yes,no,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,no
5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,no,yes,yes,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17876,Account Director - Distribution,"CA, ON, Toronto",Sales,,Vend is looking for some awesome new talent to...,Just in case this is the first time you’ve vis...,To ace this role you:Will eat comprehensive St...,What can you expect from us?We have an open cu...,no,yes,yes,Full-time,Mid-Senior level,,Computer Software,Sales,no
17877,Payroll Accountant,"US, PA, Philadelphia",Accounting,,WebLinc is the e-commerce platform and service...,The Payroll Accountant will focus primarily on...,- B.A. or B.S. in Accounting- Desire to have f...,Health &amp; WellnessMedical planPrescription ...,no,yes,yes,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,no
17878,Project Cost Control Staff Engineer - Cost Con...,"US, TX, Houston",,,We Provide Full Time Permanent Positions for m...,Experienced Project Cost Control Staff Enginee...,At least 12 years professional experience.Abil...,,no,no,no,Full-time,,,,,no
17879,Graphic Designer,"NG, LA, Lagos",,,,Nemsia Studios is looking for an experienced v...,1. Must be fluent in the latest versions of Co...,Competitive salary (compensation will be based...,no,no,yes,Contract,Not Applicable,Professional,Graphic Design,Design,no


## Fill empty values

A very common problem that we encounter when working with large data sets is the existence of null or unfilled values, which can exist due to filling errors, errors in the import and transformation of the data, non-existence of the information, or any other reason.

The treatment that will be for the null values depends on our objective with the data analysis. We can leave the values null if they do not negatively impact our analysis. We can delete the entire line that has a null value. We can even fill in the null values with a specific value. In this context, there are countless ways to treat null values, but we must always keep our objective in mind when analyzing data to decide which approach to use.

We will analyze the null values of each column and create a strategy of which variables we will keep null and which ones we will fill with values.


In [7]:
df.isnull().sum()

title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2695
benefits                7210
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [5]:
fillna = {
    "title": "Missing",
    "location": "Missing",
    "department": "Missing",
    "salary_range": "Missing",
    "company_profile": "Missing",
    "description": "Missing",
    "requirements": "Missing",
    "benefits": "Missing",
    "employment_type": "Missing",
    "required_experience": "Missing",
    "required_education": "Missing",
    "industry": "Missing",
    "function": "Missing",
}

for column, column_fillna in fillna.items():
    df[column] = df[column].fillna(column_fillna)


In [6]:
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["description"]).toarray()
y = df["fraudulent"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)

print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

MemoryError: Unable to allocate 4.15 GiB for an array with shape (8940, 62254) and data type float64

In [None]:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
print(y.value_counts(normalize=True))