# **Team:**
## **> Ahmed Samy**<br>
## **> Nader Elhadedy**

**<center><h1>Naive Bayes Classifier</h1></center>**
<hr>

**Multinomial Naive Bayes** algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the **Bayes theorem** and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output.

**Naive Bayes classifier** is a collection of many algorithms where all the algorithms share one common principle, and that is each feature being classified is not related to any other feature. The presence or absence of a feature does not affect the presence or absence of the other feature.

**Naive Bayes** is a powerful algorithm that is used for text data analysis and with problems with multiple classes. To understand Naive Bayes theorem’s working, it is important to understand the Bayes theorem concept first as it is based on the latter.

**Bayes theorem**, formulated by Thomas Bayes, calculates the probability of an event occurring based on the prior knowledge of conditions related to an event. It is based on the following formula:

**P(A|B) = P(A) * P(B|A)/P(B)**



> #  Implementing the algorithm

- ## Dataset: [SMS Spam Collection Dataset link](https://www.kaggle.com/uciml/sms-spam-collection-dataset)

In [92]:
import pandas as pd
from imblearn.over_sampling import RandomOverSampler 

Jobs = pd.read_csv('Job titles and industries.csv', encoding = "ISO-8859-1", usecols=[0,1])

print(Jobs.shape)
Jobs.head()

ModuleNotFoundError: No module named 'imblearn'

In [55]:
Jobs.rename(columns={"job title": "job_title"}, inplace=True)

In [56]:
Jobs.head()

Unnamed: 0,job_title,industry
0,technical support and helpdesk supervisor - co...,IT
1,senior technical support engineer,IT
2,head of it services,IT
3,js front end engineer,IT
4,network and telephony controller,IT


In [88]:
Jobs['industry'].value_counts()


IT             4746
Marketing      2031
Education      1435
Accountancy     374
Name: industry, dtype: int64

# Data Spliting

In [58]:
#x= df_train[Best_features]
y = Jobs[['industry']]
x = Jobs[['job_title']]
print(y.values)
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy= {'Marketing': 4746 , 'Education' :4746 ,'Accountancy':4746 })
x, y = oversample.fit_resample(x, y)
y=pd.DataFrame(y)
y.columns=['y']
Jobs = pd.concat([pd.DataFrame(x) , y],axis=1)

# Data Cleaning
### When a new message comes in, our multinomial Naive Bayes algorithm will make the classification based on the results it gets to these two equations below, where "w1" is the first word, and w1,w2, ..., wn is the entire message



In [59]:
# After cleaning
Jobs['job_title'] = Jobs['job_title'].str.replace('\W', ' ') # Removes punctuation
Jobs['job_title'] = Jobs['job_title'].str.lower()
Jobs.head(20)

Unnamed: 0,job_title,industry
0,technical support and helpdesk supervisor co...,IT
1,senior technical support engineer,IT
2,head of it services,IT
3,js front end engineer,IT
4,network and telephony controller,IT
5,privileged access management expert,IT
6,devops engineers x 3 global brand,IT
7,devops engineers x 3 global brand,IT
8,data modeller,IT
9,php web developer â 45 000 based in london,IT


In [60]:
#training_set['job_title'] = training_set['job_title'].str.split()

vocabulary = []
for job in Jobs['job_title']:
    for word in job:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))
len(vocabulary)


42

In [61]:
word_counts_per_job = {unique_word: [0] * len(Jobs['job_title']) for unique_word in vocabulary}

for index, job in enumerate(Jobs['job_title']):
    for word in job:
        word_counts_per_job[word][index] += 1

word_counts = pd.DataFrame(word_counts_per_job)
word_counts.head()

Unnamed: 0,ã,t,3,b,u,2,k,z,q,h,...,Unnamed: 12,6,0,4,j,¼,8,c,7,â
0,0,3,0,1,4,0,1,0,0,2,...,11,1,2,1,0,0,1,3,0,0
1,0,2,0,0,1,0,0,0,0,1,...,3,0,0,0,0,0,0,2,0,0
2,0,1,0,0,0,0,0,0,0,1,...,3,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,...,3,0,0,0,1,0,0,0,0,0
4,0,3,0,0,0,0,1,0,0,1,...,3,0,0,0,0,0,0,1,0,0


In [62]:
Jobs_clean = pd.concat([Jobs, word_counts], axis=1)
Jobs_clean.head(20)

Unnamed: 0,job_title,industry,ã,t,3,b,u,2,k,z,...,Unnamed: 12,6,0,4,j,¼,8,c,7,â
0,technical support and helpdesk supervisor co...,IT,0,3,0,1,4,0,1,0,...,11,1,2,1,0,0,1,3,0,0
1,senior technical support engineer,IT,0,2,0,0,1,0,0,0,...,3,0,0,0,0,0,0,2,0,0
2,head of it services,IT,0,1,0,0,0,0,0,0,...,3,0,0,0,0,0,0,1,0,0
3,js front end engineer,IT,0,1,0,0,0,0,0,0,...,3,0,0,0,1,0,0,0,0,0
4,network and telephony controller,IT,0,3,0,0,0,0,1,0,...,3,0,0,0,0,0,0,1,0,0
5,privileged access management expert,IT,0,2,0,0,0,0,0,0,...,3,0,0,0,0,0,0,2,0,0
6,devops engineers x 3 global brand,IT,0,0,1,2,0,0,0,0,...,7,0,0,0,0,0,0,0,0,0
7,devops engineers x 3 global brand,IT,0,0,1,2,0,0,0,0,...,7,0,0,0,0,0,0,0,0,0
8,data modeller,IT,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9,php web developer â 45 000 based in london,IT,0,0,0,2,0,0,0,0,...,8,0,3,1,0,0,0,0,0,1


In [87]:
import numpy as np
import pickle
import xgboost as xgb
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score , classification_report ,r2_score

#rng = np.random.RandomState(31337) random_state=rng

y = Jobs_clean[['industry']]
X = Jobs_clean.drop(columns = ['job_title','industry'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, 
              n_estimators=100, n_jobs=1, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
clf.fit(X_train, y_train)
pred_train=clf.predict(X_train)
pred_test=clf.predict(X_test)
pred_test = pd.DataFrame(pred_test)

print(accuracy_score(y_train,pred_train))
print(accuracy_score(y_test,pred_test))
print(classification_report (y_test,pred_test))

0.9799658332039136
0.8933395435491384
              precision    recall  f1-score   support

 Accountancy       0.89      0.63      0.74       103
   Education       0.88      0.81      0.84       363
          IT       0.91      0.97      0.94      1175
   Marketing       0.87      0.82      0.84       506

    accuracy                           0.89      2147
   macro avg       0.89      0.81      0.84      2147
weighted avg       0.89      0.89      0.89      2147



In [69]:
print(clf.best_score_)


AttributeError: 'XGBClassifier' object has no attribute 'best_score_'

In [93]:
import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston

rng = np.random.RandomState(31337)

print("Zeros and Ones from the Digits dataset: binary classification")
digits = load_digits(n_class=2)
y = digits['target']
X = digits['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBClassifier(n_jobs=1).fit(X[train_index], y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print(confusion_matrix(actuals, predictions))

print("Iris: multiclass classification")
iris = load_iris()
y = iris['target']
X = iris['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBClassifier(n_jobs=1).fit(X[train_index], y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print(confusion_matrix(actuals, predictions))

print("Boston Housing: regression")
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor(n_jobs=1).fit(X[train_index], y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print(mean_squared_error(actuals, predictions))

print("Parameter optimization")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor(n_jobs=1)
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2, 4, 6],
                    'n_estimators': [50, 100, 200]}, verbose=1, n_jobs=1)
clf.fit(X, y)
print(clf.best_score_)
print(clf.best_params_)

# The sklearn API models are picklable
print("Pickling sklearn API models")
# must open in binary format to pickle
pickle.dump(clf, open("best_boston.pkl", "wb"))
clf2 = pickle.load(open("best_boston.pkl", "rb"))
print(np.allclose(clf.predict(X), clf2.predict(X)))

# Early-stopping

X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier(n_jobs=1)
clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="auc",
        eval_set=[(X_test, y_test)])

Zeros and Ones from the Digits dataset: binary classification
[[87  0]
 [ 1 92]]
[[91  0]
 [ 2 87]]
Iris: multiclass classification
[[19  0  0]
 [ 0 31  3]
 [ 0  1 21]]
[[31  0  0]
 [ 0 16  0]
 [ 0  1 27]]
Boston Housing: regression
9.656600452186087
19.171199667160245
Parameter optimization
Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   11.3s finished


0.6839859272017424
{'max_depth': 2, 'n_estimators': 100}
Pickling sklearn API models
True
[0]	validation_0-auc:0.99950
[1]	validation_0-auc:0.99975
[2]	validation_0-auc:0.99975
[3]	validation_0-auc:0.99975
[4]	validation_0-auc:0.99975
[5]	validation_0-auc:0.99975
[6]	validation_0-auc:1.00000
[7]	validation_0-auc:1.00000
[8]	validation_0-auc:1.00000
[9]	validation_0-auc:1.00000
[10]	validation_0-auc:1.00000
[11]	validation_0-auc:1.00000
[12]	validation_0-auc:1.00000
[13]	validation_0-auc:1.00000
[14]	validation_0-auc:1.00000
[15]	validation_0-auc:1.00000
[16]	validation_0-auc:1.00000




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=1, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
# pre processing of input to predict
    new_job = np.array([new_job])
    new_job = pd.DataFrame(data=new_job, index=None, columns=["job_title"])
    new_job['job_title'] = new_job['job_title'].str.replace('\W', ' ') # Removes punctuation
    new_job['job_title'] = new_job['job_title'].str.lower()
    vocabulary = []
    for job in new_job['job_title']:
        for word in job:
            vocabulary.append(word)
    vocabulary = list(set(vocabulary))
    word_counts_per_job = {unique_word: [0] * len(new_job['job_title']) for unique_word in vocabulary}
    for index, job in enumerate(new_job['job_title']):
        for word in job:
            word_counts_per_job[word][index] += 1
    final_shape_input = pd.DataFrame(word_counts_per_job)