##### Context and Content

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

The whole data divided to train and test . Target isn't included in test but the test target values data file is in hands for related tasks. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target


##### Features

enrollee_id : Unique ID for candidate

city: City code

city_ development _index : Developement index of the city (scaled)

gender: Gender of candidate

relevent_experience: Relevant experience of candidate

enrolled_university: Type of University course enrolled if any

education_level: Education level of candidate

major_discipline :Education major discipline of candidate

experience: Candidate total experience in years

company_size: No of employees in current employer's company

company_type : Type of current employer

lastnewjob: Difference in years between previous job and current job

training_hours: training hours completed

target: 0 – Not looking for job change, 1 – Looking for a job change

Inspiration
Predict the probability of a candidate will work for the company
Interpret model(s) such a way that illustrate which features affect candidate decision

##### Goal of the project

Prepare the model to predict whether someone is willing to change the job.

##### Minor goals
- Investigate the data
- Understand the data - search for insights, clues
- Preprocess the data, so it can be used later on by a machine learning model

In [1]:
# import the pandas library

import pandas as pd

In [2]:
# read the file with training data

file = 'data/aug_train.csv'
df = pd.read_csv(file)

In [None]:
# let's see the first 5 rows of the dataset

df.head()

##### Examining the data

We need to get a grasp of how big the dataset is. Also, we need to know how many cases we have for both cases we want to predict:
- people who want to change their job
- people who want to stay at their current workplace

In [None]:
df.describe()

# Question: Did this output tell us anything? Conclusions?

In [None]:
df.shape

# Question: Is this enough? Can we expect to have our algorithm work properly? What else do we want to check?

In [None]:
df['target'] == 0

In [None]:
df['target'] == 1

In [None]:
df[df['target'] == 0].shape

In [None]:
df[df['target'] == 1].shape

# Question: What do we notice here? What does it tell us? Why it's important from the algorithm perspective?

### Investigation of columns
Let's investigate separate columns one by one.

##### Two ways of manipulating pandas dataframes:
- assign output of a function to a variable
- set *inplace* flag to True

In [10]:
# 1) sorted_df = df.set_index('enrollee_id').sort_index()
# 2) df.set_index('enrollee_id', inplace=True)

In [11]:
df = df.set_index('enrollee_id')

In [None]:
df.head()

In [13]:
df = df.sort_index()

In [None]:
df.head()

# Question: Look at the rest of the columns: please answer what dtypes do their represent?

### Let's investigate the unique values for columns containing categorical variables

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.info(memory_usage='deep')

In [18]:
continous_columns = ['city_development_index', 'training_hours']

In [19]:
categorical_columns = ['city', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'major_discipline', 'experience', 'company_size', 'company_type', 'last_new_job']

In [20]:
df[categorical_columns] = df[categorical_columns].astype('category')

In [None]:
df.info(memory_usage='deep')

In [None]:
df.dtypes

In [None]:
df.head()

### Verify what kind of unique values exist in each column:

In [None]:
for col in categorical_columns:
    unique_items = df[col].unique().tolist()
    print(f'col name: {col}, unique_items: {unique_items}')

In [None]:
for col in categorical_columns:
    unique_items = df[col].unique().tolist()
    print(f'col name: {col}, len of unique items: {len(unique_items)}')

### See how many nans exist in each column:

In [26]:
from pandas import isnull
df_is_na_value = df.applymap(lambda x: isnull(x))

In [27]:
df_is_na_value = df.apply(lambda x: x.isna())

In [None]:
df_is_na_value.sum(axis=0)

In [None]:
df_is_na_value.sum(axis=1)

# Question: What do we do with those nan values?

### A quick remark on speed issues in pandas
Let's assume we want to perform a very simple operation - we need to raise the values in the 'city_development_index' column to the second power. Let's try three approaches:
- loops
- apply() method
- builtin vectorized pandas method

##### Loop

In [30]:
%%timeit

for _, record in df.iterrows():
    record['city_development_index'] ** 2

848 ms ± 62.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


##### apply()

In [31]:
%%timeit

df['city_development_index'].apply(lambda x: x**2)

5.43 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


##### vectorized builtin method

In [32]:
%%timeit

df['city_development_index'] ** 2

99.1 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### One Hot Encoding - what it is?

In [33]:
from sklearn.preprocessing import OneHotEncoder

In [34]:
encoder = OneHotEncoder()

In [35]:
encoder.fit(df[categorical_columns])

OneHotEncoder()

In [36]:
encoded = encoder.transform(df[categorical_columns])

In [37]:
encoded.shape

(19158, 192)

In [38]:
type(encoded)

scipy.sparse._csr.csr_matrix

In [None]:
encoded.toarray()

In [None]:
encoder.categories_

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [None]:
encoded.shape

In [78]:
from sklearn.model_selection import train_test_split

In [None]:
encoded.toarray()

In [45]:
continous_values = df[continous_columns].values

In [None]:
continous_values

In [None]:
continous_values.shape

In [48]:
import numpy as np

In [None]:
print(encoded.toarray().shape)
print(continous_values.shape)

In [50]:
x_values = np.concatenate(
    (encoded.toarray(), continous_values), axis=1
)

In [None]:
x_values.shape

In [79]:
y_values = df['target'].values

In [None]:
y_values

In [None]:
y_values.shape

In [55]:
x_train, x_val, y_train, y_val = train_test_split(x_values, y_values)

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_val.shape)
print(y_val.shape)


In [57]:
model = LogisticRegression()

In [None]:
model.fit(x_train, y_train)

In [59]:
y_pred = model.predict(x_val)

In [None]:
y_pred

In [61]:
from sklearn.metrics import accuracy_score, f1_score, recall_score

In [62]:
y_pred_train = model.predict(x_train)

In [63]:
train_acc = accuracy_score(y_train, y_pred_train)
val_acc = accuracy_score(y_val, y_pred)

In [None]:
print(f'training accuracy: {train_acc}')

In [None]:
print(f'validation accuracy: {val_acc}')

In [66]:
train_f1 = f1_score(y_train, y_pred_train)
val_f1 = f1_score(y_val, y_pred)

In [None]:
train_f1

In [None]:
val_f1

In [71]:
from sklearn.metrics import classification_report

In [None]:
classification_report(y_val, y_pred, output_dict=True, target_names=['not leaving', 'leaving'])

#### What can you tell about those results? Are we satisfied with this approach? Can you please try other models?

* RandomForestClassifier
* XGBClassifier
* DecisionTree

### How can we manipulate with the data? What can be done? Please suggest mainly what can be done with:

* high cardinality of the data
* NaN values

### Explain the results - Eli5

In [73]:
import sys

In [74]:
sys.executable

'/Users/patrykseweryn/Programowanie/ai_academy/venv/bin/python'

In [1]:
import eli5

In [None]:
eli5.show_weights(model)

In [None]:
eli5.show_weights(model, top=10)

### Use eli5 to explain predictions for single elements
### What is the difference between the function 'show_weights()' and 'explain_prediction()'?

### Tasks
  - load aug_test.csv file and perform evaluation of the model on the test set. how is it different from the validation dataset?
  - try using different models - DecisionTreeClassifier, RandomForestClassifier, SVM, Bayesian Methods - which one of them gives the best results? What other things did you notice?
  - try doing a better feature selection
  - think how you could handle nan values - remove records? use imputer? or something else?
  - Use eli5 to explain predictions for single elements
  - What is the difference between the function 'show_weights' and 'explain_prediction()'? Check the answer in eli5 documentation.
