# Assignment 3, Task 1: Apply at least one model from another group to the dataset to classify the industry.

## Assignment Overview:
We begin with two datasets: `employer_raw_data_group_2.csv` and `industry_data.csv`.

* `employer_raw_data_group_2.csv` contains 20,000 rows. Each row has the name of a employer and a paragraph description of that employer.
* `industry_data.csv` contins 13 rows. Each row has the name of an industry and a paragraph description of that industry. 

Our task is to use this data to develop a model that examines the description of a company and determines which industry it belongs to.

#### Modules
We start by importing all of the modules we'll be using along the way.

In [1]:
# Importing Modules
import numpy as np
import pandas as pd
import regex as re
import string
import unicodedata
import nltk
import random
nltk.download('wordnet')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Read Employer Data
Now, we get the employer dataset that we've been given and we put it into a Pandas dataframe.

In [2]:
# Reading in employer data
employer_data = pd.read_csv('employer_raw_data_group_2.csv')

#### Preprocessing
Here, we set up and use a preprocessing function to clean the descriptions of the employers, making them easier for the model to use.

In [3]:
# Preprocessing employer data

# Creating Preprocessing function
def get_preprocessing_function(
    use_lower: bool = True,
    use_alpha: bool = True,
    use_stemming: bool = False,
    use_nodates: bool = False,
    use_nourl: bool = True,
    use_stopwords: bool=False,
    use_lemmatizer: bool=False,
    use_nocity: bool=False
):

    # Setting up Stemmer and Lemmatizer
    stemmer = nltk.stem.SnowballStemmer("english")
    stop_words = []
    with open("stopwords.txt", "r") as f_in:
            stop_words = [i.strip().lower() for i in f_in.readlines()]
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    def alpha(text: str): # Removes non-alphabetic characters
        return re.sub("[^a-z]+", " ", text) if use_alpha else text

    def lower(text: str): # Makes all capital letters lowercase
        return text.lower() if use_lower else text
        
    def stemming(text: str): # Stems words, making shorter, more
        if use_stemming:
            text = ' '.join(stemmer.stem(x) for x in text.split())
        return text
    
    def dates(text: str): # Removes date names and abreviations
        dates = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 
    'sept', 'oct', 'nov', 'dec'] 
        return " ".join([word for word in text.split(" ") if word not in dates]) if use_nodates else text

    def url(text: str): # Removes urls
        url_pattern = re.compile('\\S*\\.com\\b|https?://\S+|www\.\S+')
        return url_pattern.sub('', text) if use_nourl else text
    
    def remove_stopwords(text): # Removes commonly-used words (stopwords)
        return " ".join([word for word in text.split(" ") if word not in stop_words]) if use_stopwords else text
    
    def lemmatize(text: str):
        if use_lemmatizer:
            text=' '.join(lemmatizer.lemmatize(x) for x in text.split())
        return text
    
    def cityremover(text: str):
        city_state_pattern = re.compile("(?<![A-Za-z])[A-Z][a-z]+, [A-Z]{2}(?![A-Za-z])") # Matches strings like "Nashville, TN"
        return city_state_pattern.sub('city', text) if use_nocity else text

    def preprocess(text: str):
        #Create list of steps
        steps = [lower,url, alpha, dates, cityremover, remove_stopwords, lemmatize, stemming]
        for step in steps:
            text = step(text)
        return text
    
    return preprocess

# Instantiating preprocessing function
preprocess = get_preprocessing_function(
    use_lower= True,
    use_alpha= True,
    use_stemming= False,
    use_nodates= True,
    use_nourl= True,
    use_stopwords= True,
    use_lemmatizer= False,
    use_nocity=True
)

# Preprocessing employer data

# Commenting this out, since we have already gotten the preprocessed data in the csv
# employer_data['cleaned_description'] = employer_data['description'].apply(preprocess)

#### Reading Cleaned Data into CSV
Now that we've preprocessed all 20,000 employer descriptions, we can write the cleaned descriptions back into the CSV file so that they'll be readily available for future use.

In [4]:
# Reading cleaned employer data into csv file
employer_data = employer_data[['employers', 'description', 'cleaned_description']]
employer_data.to_csv('employer_raw_data_group_2.csv')

In [5]:
employer_data.head()

Unnamed: 0,employers,description,cleaned_description
0,"moovel north america, llc",moovel’s scalable solutions allow transit agen...,moovel scalable solution allow transit agency ...
1,promon engenharia ltda.,Promon Engenharia provides complete solutions ...,promon engenharia provides complete solution e...
2,aller press a/s,Aller Media A/S Havneholmen 33 1561 København ...,aller medium havneholmen k benhavn v tlf email...
3,"teen librarian,rocky river public library",Megan Alabaugh Teen Librarian at Rocky River P...,megan alabaugh teen librarian rocky river publ...
4,executive director at vahera investments zimba...,Tadiwanashe Joy Pazvakavambwa | Zimbabwe | Exe...,tadiwanashe joy pazvakavambwa zimbabwe executi...


#### Read Industry Data
Next, we read in the other dataset, which provides a description of 13 different industries.

In [6]:
# Reading in industry data
industry_data = pd.read_csv('industry_data.csv')

#### Preprocess Industry Data
Now that we've read in this dataset, we'll use the same preprocessing function we already made to clean the descriptions of the industries.

In [7]:
# Preprocessing industry data
industry_data['cleaned_description'] = industry_data['description'].apply(preprocess)

#### Set up Vectorizer
We've imported and cleaned all relevant data. Now, it's time to start transforming it into a more useable format. Step one: vectorize each desccription.

We decided to use CountVectorizer, rather than TFIDF. We tried both, and CountVecorizer seemed to yield better results than TFIDF. Specifically, since we're planning to use Latent Dirichlet Allocation (LDA), CountVectorizer is works better with LDA than TFIDF does, although the difference is admittedly not enormous.

In [8]:
# Setting up vectorizer

# We're using CountVectorizer because, from what I've read, it works better with LDA than tf_idf would
count_vectorizer = CountVectorizer(
    preprocessor= None, # We already preprocessed the data
    ngram_range=(1,1),
    tokenizer=lambda s: s.split(),
    min_df=2, 
    max_df=0.60  
)

#### Fit Vectorizer to the Data
Here, we combine the two datasets and fit the vectorizer to the entire corpus.

In [9]:
# Fitting vectorizer to all data
all_descriptions = np.concatenate([industry_data['cleaned_description'].values, employer_data['cleaned_description'].values])
count_vectorizer.fit(all_descriptions);

  "The parameter 'token_pattern' will not be used"


#### Create Industry Matrix
Now, we can create a 20,000 row matrix, where each row is the vectorized version of the corresponding industry description.

In [10]:
# Creating industry matrix
industry_matrix = count_vectorizer.transform(industry_data['cleaned_description'].values)

#### Create Employer Matrix
As with the industry matrix, we also make a 13 row matrix, where each row is the vectorized version of the corresponding employer description.

In [11]:
# Creating employer matrix
employer_matrix = count_vectorizer.transform(employer_data['cleaned_description'].values)

#### Set up LDA
Now that we have vectorized versions of the company and industry descriptions, we will use Latent Dirichlet Allocation (LDA) to identify 30 topics discussed in these descriptions and determine how strongly each description matches each topic.

In [12]:
# Setting up LDA
lda = LatentDirichletAllocation(
    n_components=30
)

#### Fit LDA to the Data
As with the CountVectorizer, we will use the entire corpus, including both datasets, to fit LDA.

In [None]:
all_vectors = count_vectorizer.transform(all_descriptions)
lda.fit(all_vectors)

#### Transform and Classify Each Employer
Now, all that remains to be done is to use LDA to transform the industry and employer matrices, and then compare each employer to each industry. For a particular employer, whichever industry it most closely matches will be the model's prediction.

In [None]:
industry_topics = lda.transform(industry_matrix)

employer_topics = lda.transform(employer_matrix)

industry_names = industry_data["industry"].values

industry_prediction = []
for employer_vec in employer_topics:
    distances = []
    for industry_vec in industry_topics:
        #Look at how close the company topics are from the industry
        distances.append(np.linalg.norm(industry_vec - employer_vec))
    #Pick the closest company
    best_industry_index = np.argmin(distances)
    industry_prediction.append(industry_names[best_industry_index])


employer_data["industry_prediction"] = industry_prediction

#### Examine Predictions
With our predictions generated, we're finished!

In [None]:
predictions = employer_data[['employers', 'description', 'industry_prediction']]
predictions.sample(20)

Unnamed: 0,employers,description,industry_prediction
3801,gold canyon candles llc,Gold Canyon Lady may not have the same ring as...,Manufacturing
6820,universal logistics solutions canada,Speed. Accuracy. Consistency Universal Logisti...,Real Estate and Construction
7540,modern video film,Nine noble families fight for control over the...,Media and Entertainment
517,"signpost,",Customer communication software for Signpost h...,Media and Entertainment
8844,national petroleum services company,"National Petroleum Services Company – NPSC, is...",Energy and Utilities
3087,jbt corporation,JBT Corporation 2020 Annual Report. View Repor...,Manufacturing
16231,other - acuris global,Acuris Find your advantage. The Acuris Differe...,Manufacturing
6390,ncsoft (do not call this employer),Unrealistic expectations and workload. Employe...,Media and Entertainment
15606,cyberplex,CyberPlex Technology Training was established ...,Media and Entertainment
19375,escape above spa and yoga lounge,COVID update: Hudavi Wellness has updated thei...,Computer and Electronics


# Task 2:the word vector representation, how do you think this model performs compare to the others. Create a embeddings representation of each industry using spacy and find the closest industry using doc_1.similarity(industry_1)

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) #get the word embedding model

NameError: name 'spacy' is not defined

In [None]:
#getting the top 15 words from the industry
top_n = 15
words = np.array(tfidf_vec.get_feature_names())
industry_top_words = []
for i in range(tfidf_industry_matrix.shape[0]):
    # Will get the words that are in the TFIDF which have the higher score
    # We use -tfidf_industry_matrix because the order is ascending
    s = np.argsort(np.asarray(-tfidf_industry_matrix[i, :].todense()).flatten())
    industry_top_words.append(" ".join(words[s[:top_n]]))

In [None]:
#getting the top 25 words from the employer
employer_top_words = []
for i in range(tfidf_employer_matrix.shape[0]):
    s = np.argsort(np.asarray(-tfidf_employer_matrix[i, :].todense()).flatten())
    employer_top_words.append(" ".join(words[s[:top_n]]))

In [None]:
#finds the highest similarity between the topics and the employers using wordembedding similarity
industries = []
industry_embeddings = [nlp(industry) for industry in industry_top_words]
for employer in employer_top_words:
    employer_doc = nlp(employer)
    similarities = [employer_doc.similarity(industry) for industry in industry_embeddings]
    industries.append(industry_names[np.argmax(similarities)])

In [None]:
employer_data["industry"] = industries

In [None]:
employer_data

In [None]:
employer_data.to_csv('POSSIBLE_RESULTS.csv') # Check this when it finishes running to see the results

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=8c3c1e35-d859-43d3-a474-2266dd536418' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>