# Recommendation System (Random Forest)

* The main task of this notebook is to make a recommendation system using Random Forest

## Preprocessing

In [1]:
# import all the libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.base import BaseEstimator, TransformerMixin

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_excel("train.xlsx")

In [3]:
df.head(2)

Unnamed: 0,no,nama,prodi,minat,bakat,mapel,nilai,ibu,ayah,penghasilan
0,1,DYAH AYU,akuntansi,Minat terhadap aktivitas yang berhubungan deng...,Bakat terhadap aktivitas yang berhubungan deng...,"SEJARAH, GEOGRAFI",89,PETANI,Meninggal,golongan 2
1,2,Muhammad isyak rizqi,teknik informatika,Minat terhadap aktivitas yang berhubungan deng...,Bakat terhadap aktivitas yang berhubungan deng...,komputer,96,WIRAUSAHA,Meninggal,golongan 2


* Notice that all the value of categorical columns isn't in standard value for text.

* We also going to delete `no` and `nama` columns since its was useless for our model

In [4]:
# Drop no and nama
df.drop(["no", "nama"], axis=1, inplace=True)

* next, we are going to clean the text value

### Text cleaning

In [5]:
# select category columns
text_cols = df.select_dtypes(exclude="int64").columns
print(text_cols)

Index(['prodi', 'minat', 'bakat', 'mapel', 'ibu', 'ayah', 'penghasilan'], dtype='object')


In [6]:
# 1. make all text in lowercase using lower()
# 2. delete all dot(.) and comma(,) using replace(",","")
# 3. delete unusual space between text using strip()

for col in text_cols:
  df[col] = df[col].str.lower().str.replace(",","").str.replace(".","").str.strip()

In [7]:
df.head(2)

Unnamed: 0,prodi,minat,bakat,mapel,nilai,ibu,ayah,penghasilan
0,akuntansi,minat terhadap aktivitas yang berhubungan deng...,bakat terhadap aktivitas yang berhubungan deng...,sejarah geografi,89,petani,meninggal,golongan 2
1,teknik informatika,minat terhadap aktivitas yang berhubungan deng...,bakat terhadap aktivitas yang berhubungan deng...,komputer,96,wirausaha,meninggal,golongan 2


* Now the dataset seems all good
* Next let's do some label encoder for columns: `prodi`, `ibu`, `ayah`, `penghasilan`

### LabelEncoder(Prodi,Ibu,Ayah,Penghasilan)

In [8]:
cols = ["prodi", "ibu", "ayah", "penghasilan"]
le = LabelEncoder()

for col in cols:
  le.fit(df[col])
  df[col] = le.transform(df[col])

In [9]:
df.head(3)

Unnamed: 0,prodi,minat,bakat,mapel,nilai,ibu,ayah,penghasilan
0,2,minat terhadap aktivitas yang berhubungan deng...,bakat terhadap aktivitas yang berhubungan deng...,sejarah geografi,89,5,3,1
1,38,minat terhadap aktivitas yang berhubungan deng...,bakat terhadap aktivitas yang berhubungan deng...,komputer,96,8,3,1
2,38,minat terhadap aktivitas yang berhubungan deng...,bakat terhadap aktivitas yang berhubungan deng...,biologi bahasa inggris bahasa indonesia,96,2,4,1


* Good now let's vectorize the others

### TfIdf Vectorizer(Minat,Bakat,Mapel)

In [10]:
cols = ["minat", "bakat", "mapel"]
vectorizer = TfidfVectorizer()
for col in cols:
  vectorizer.fit(df[col])
  df[col] = vectorizer.transform(df[col]).toarray()

In [11]:
df.head(2)

Unnamed: 0,prodi,minat,bakat,mapel,nilai,ibu,ayah,penghasilan
0,2,0.148217,0.147235,0.0,89,5,3,1
1,38,0.177834,0.14795,0.0,96,8,3,1


* Ok now all the value on our dataset are on integer format which mean now we can train our model
* N.B: All ML model only accept integer as their input

## Modelling(RandomForest)

* Now let's train our RandomForest model

In [12]:
df.head(2)

Unnamed: 0,prodi,minat,bakat,mapel,nilai,ibu,ayah,penghasilan
0,2,0.148217,0.147235,0.0,89,5,3,1
1,38,0.177834,0.14795,0.0,96,8,3,1


* Let's split our dataset into feature and target

### Feature and Target

In [13]:
X = df.drop(["prodi"], axis=1)
y = df["prodi"]

* let's check the proportion of the target first

In [14]:
y.value_counts()

38    24
2     12
42     9
9      8
36     7
34     7
19     6
17     6
13     4
7      3
23     3
4      3
37     3
32     3
22     2
44     2
30     2
41     2
39     2
25     2
31     2
27     2
16     2
33     1
1      1
3      1
28     1
21     1
5      1
29     1
10     1
35     1
8      1
43     1
11     1
6      1
12     1
18     1
14     1
26     1
40     1
0      1
15     1
20     1
24     1
Name: prodi, dtype: int64

* The target was extreamly imbalanced and might effect the performance of our model

* Let's use oversampling technique to settle this.

### Oversampling with imblearn

In [15]:
from imblearn.over_sampling import RandomOverSampler

sampler = RandomOverSampler()

X, y = sampler.fit_resample(X,y)

In [16]:
y.value_counts()

2     24
39    24
12    24
23    24
11    24
43    24
13    24
8     24
3     24
16    24
10    24
30    24
41    24
5     24
21    24
25    24
1     24
33    24
7     24
28    24
29    24
18    24
37    24
38    24
26    24
36    24
42    24
9     24
22    24
34    24
32    24
6     24
14    24
40    24
35    24
19    24
17    24
44    24
27    24
20    24
31    24
15    24
0     24
4     24
24    24
Name: prodi, dtype: int64

* Now all the target all perfectly balanced

### Train and Accuracy

* before we train our model, let's split the feature and target into train and test set

In [17]:
# split into train set and test set with 30% test size
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,
                                                    random_state=42)

In [21]:
# fit the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

* Now let's see how accurate our model is

In [23]:
y_pred = model.predict(X_test)

print("RandomForest Accuracy Score: {:.2f}".format(accuracy_score(y_test, y_pred)))

RandomForest Accuracy Score: 0.97


* The model achieve great accuracy with 97% accuracy score

* Let's save our model for next development process

### Saving Model

In [24]:
# saving model with pickle
import pickle

pickle.dump(model, open('model.pkl', 'wb'))

### Load and Test Model

In [25]:
loaded_model = pickle.load(open('model.pkl', 'rb'))

In [26]:
y_load_pred = loaded_model.predict(X_test)

In [28]:
print("Loaded Model Accuracy: {:.2f}".format(accuracy_score(y_test, y_load_pred)))

Loaded Model Accuracy: 0.97


* Yup its working perfectly