# BIA-660: Web Scraping
**Final Project Part 2: Classification**

**Date:** 

**Team**
- Jarrin Sacayanan
- Sabah Ahmed
- Million Mehari

## Instructions

### Steps
1. Collect at least 5,000 Job Ads for Data Scientists from Indeed.com
2. Collect at least 5,000 Job Ads for Software Engineers from Indeed.com
3. Get the HTML of the job description (as shown on the right side of the screen after you click on an Ad) for each Ad.
4. Extract the text from the HTML and create a CSV with 1 Ad per line and 2 columns: `<text>` and `<job title>`
5. Train a classificatino model that can predict whether a given Ad is for a Data Scientist or Software Engineer

### Notes
- Your trained model will be evaluated on a separate test set that you will not have access to before the deadline
- The deliverables include:
    - The scraping script(s) in .ipynb format
    - The classification script as a separate .ipynb Notebook
    - Instructions on how to run the 2 Notebooks
    - The CSV from step 4
- Your classification script should be able t oread a test CSV that will include 1 job description per line (no labels). It should then produce a new file that includes the predicted label for each line in the test file.

In [32]:
# Basic imports
import pandas as pd
import numpy as np
import random
import nltk
from nltk.corpus import stopwords

# Misc sklearn imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Classification model imports
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Load the Scraped Data

In [8]:
def load_data(file_name):
    new_df = None
    try:
        new_df = pd.read_csv(file_name)
        new_df = new_df.iloc[:, 1:]
    except Exception as e:
        print(f'Something went wrong...{e}')
    
    return new_df

In [11]:
data_scientist_df = load_data('ds_scrape_1.csv')
software_engineer_df = load_data('se_scrape_1.csv')

In [21]:
data_scientist_df.head()

Unnamed: 0,Title,Description,Label
0,Senior Data Analyst,Responsibilities\nThe Semel Institute for Neur...,Data Scientist
1,Video Revenue and Subscription Data Scientist,"Summary\nPosted: Apr 5, 2022\nRole Number:2003...",Data Scientist
2,Marketing Data Scientist,Title: Marketing Data Scientist\nDuration: 12 ...,Data Scientist
3,Data Scientist,Data Scientist – 100% Remote\nSalary: $120K - ...,Data Scientist
4,Data Scientist - TikTok US - Tech Services,TikTok is the leading destination for short-fo...,Data Scientist


In [22]:
software_engineer_df.head()

Unnamed: 0,Title,Description,Label
0,Entry Level Software Engineer,We are seeking creative and talented individua...,Software Engineer
1,"Software Engineer - Availability, Cash App",Company Description\n\nIt all started with an ...,Software Engineer
2,Software Engineer II,Disney Streaming’s Growthlife QA team is seeki...,Software Engineer
3,Remote Entry Level Software Engineer,SkillStorm is actively seeking Full-time Entry...,Software Engineer
4,Software Development Engineer,Come build the future as a software developmen...,Software Engineer


In [16]:
# Combine the two datasets randomly into eachother
dfs = [data_scientist_df, software_engineer_df]
random.shuffle(dfs)
combined_df = pd.concat(dfs, axis=0)

In [24]:
combined_df.shape

(2690, 3)

In [27]:
# Split into features and target columns
x = combined_df['Description']
y = combined_df['Label']

# Split the data into training and testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [33]:
# Get the stopwords
nltk.download('stopwords')

# Clean the data
counter = CountVectorizer(stop_words=stopwords.words('english'))
counter.fit(x_train)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jarri\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


CountVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])

In [40]:
# Vectorize the job descriptions
counts_train = counter.transform(x_train)
counts_test = counter.transform(x_test)

In [42]:
# Set up predictor classes
KNN_classifier = KNeighborsClassifier()
RF_classifier = RandomForestClassifier()
SVM_classifier = svm.SVC()
NN_classifier = MLPClassifier()

In [44]:
#build the parameter grid
KNN_grid = [{'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19], 'weights':['uniform','distance']}]

#build a grid search to find the best parameters
gridsearchKNN = GridSearchCV(KNN_classifier, KNN_grid, cv=5)

#run the grid search
gridsearchKNN.fit(counts_train, y_train)

# Get the best parameters
KNN_best = gridsearchKNN.best_estimator_

In [45]:
# Build a parameter grid
RF_grid = [{'n_estimators': [5, 10, 20, 50, 100, 150, 200], 'criterion': ['gini', 'entropy'], 'random_state': [42]}]

# Build a grid search to find the best parameters
gridsearchRF = GridSearchCV(RF_classifier, RF_grid, cv=5)

# Run the grid search
gridsearchRF.fit(counts_train, y_train)

# Get the best params
RF_best = gridsearchRF.best_estimator_

In [46]:
# Build a parameter grid
SVM_grid = [{'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'gamma': ['scale', 'auto']}]

# Build a grid search to find the best parameters
gridsearchSVM = GridSearchCV(SVM_classifier, SVM_grid, cv=5)

# Run the grid search
gridsearchSVM.fit(counts_train, y_train)

# Get the best params
SVM_best = gridsearchSVM.best_estimator_

In [48]:
# Build a parameter grid
neuron_count = list(range(0, 8))
hidden_layers = list(range(0, 5))
hidden_combos = []
for i in neuron_count:
    for j in hidden_layers:
        hidden_combos.append((i, j))

NN_grid = {'hidden_layer_sizes': hidden_combos, 'activation':['identity'], 'solver': ['lbfgs', 'sgd'], 'random_state': [42], 'max_iter': [300, 500, 700]}

# Build a grid search to find the best parameters
gridsearchNN = GridSearchCV(NN_classifier, NN_grid, cv=5)

# Run the grid search
gridsearchNN.fit(counts_train, y_train)

# Get the best params
NN_best = gridsearchNN.best_estimator_

360 fits failed out of a total of 1200.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\jarri\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jarri\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 752, in fit
    return self._fit(X, y, incremental=False)
  File "C:\Users\jarri\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py", line 386, in _fit
    raise ValueError(
ValueError: hidden_

In [49]:
predictors = [
    ('knn', KNN_best), 
    ('rf', RF_best),
    ('svm', SVM_best),
    ('nn', NN_best)
]

VT=VotingClassifier(predictors)

In [52]:
# Fit the voting classifier
VT.fit(counts_train, y_train)

#use the VT classifier to predict
predicted=VT.predict(counts_test)

#print the accuracy
print(accuracy_score(predicted, y_test))

0.9925650557620818
