<a href="https://colab.research.google.com/github/Seifollahi/industry_by_company_name/blob/master/app.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Industry By Company Name

This project is developing a machine learning model to predict the NAICS code based on the name of the company. So far it can predict the industry by 0.41 accuracy using the SGD algorithm.

In [0]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [0]:
url = "https://github.com/Seifollahi/industry_by_company_name/blob/master/data/training-testing/train-test.csv?raw=true"

df = pd.read_csv(url)

In [0]:
df.head

<bound method NDFrame.head of                                   bus_name  ... tollfree
0      jang kang korean chinese restaurant  ...      NaN
1                       sheridan nurseries  ...      NaN
2                      shinhan bank canada  ...      NaN
3                             shell canada  ...      NaN
4         cooksville hair and beauty salon  ...      NaN
...                                    ...  ...      ...
16339                      dentistry on 10  ...      NaN
16340   stop the clock naturopathic clinic  ...      NaN
16341          beauleigh retail consulting  ...      NaN
16342         le niagara commodities corp.  ...      NaN
16343                      fritz marketing  ...      NaN

[16344 rows x 13 columns]>

In [0]:
df.columns

Index(['bus_name', 'emp_range', 'naics_6', 'naics_desc', 'street_no',
       'street_name', 'postcode', 'unit', 'phone', 'fax', 'email', 'website',
       'tollfree'],
      dtype='object')

In [0]:
df_selected = df[["bus_name","naics_6"]]

In [0]:
X, Y = df_selected.bus_name, df_selected.naics_6.astype(str)

In [0]:
features = X.to_list()

## Lemmatizing the data

In this section we used Textblob mocule to lemmatize the input text.

In [0]:
def split_into_lemmas(features):
    # features = str.encode(features, 'utf8', errors='replace').lower()
    words = TextBlob(features).words 
    return [word.lemma for word in words]

bow = CountVectorizer(analyzer=split_into_lemmas).fit(features)
print ("Length of Vocabulary : "+str(len(bow.vocabulary_)))

Length of Vocabulary : 11887


### Term Frequency times inverse document frequency (TF-IDF): 

TF-IDF used to reduce the weight of most common words such as "the", "a", "an".

In [0]:
bow_list = bow.transform(features)

tfidf_transformer = TfidfTransformer().fit(bow_list)

bow_tfidf = tfidf_transformer.transform(bow_list)
print ("Dimension of the Document-Term matrix : "+str(bow_tfidf.shape))

Dimension of the Document-Term matrix : (16344, 11887)


## Machine Learning

SkLearn module used to create the model and predict the labels based on the features.

In [0]:
train, test, label_train, label_test = train_test_split(X,Y, test_size=0.1)

print ("Number of samples in Training Dataset : "+str(len(train)))
print ("Number of samples in Testing Dataset : "+str(len(test)))


Number of samples in Training Dataset : 14709
Number of samples in Testing Dataset : 1635


In [0]:
train

11365                                     skyway jacks
11250    options mississauga print and office services
12435                      vista heights public school
9874                      wood studio of matthew leite
7198                      t s t truckload express inc.
                             ...                      
874                                             pho le
11542                                      canada post
16292        b i logistics services inc. ( b i l s i )
8053                               navair technologies
9052                           royal montessori school
Name: bus_name, Length: 14709, dtype: object

SGD

In [0]:
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline([('bow', CountVectorizer(analyzer=split_into_lemmas)),
                      ('tfidf', TfidfTransformer()),
                      ('classifier', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=5, random_state=42))])

pipeline = pipeline.fit(train, label_train)

predicted = pipeline.predict(test)

print ("Accuracy Score SGDClassifier : "+str(accuracy_score(label_test, predicted)))



Accuracy Score SGDClassifier : 0.41162079510703364


In [0]:
# from __future__ import print_function

# print ("Actual Result : \n")
# for i,j in enumerate(label_test):
#     print (str(j)+", ", end='')

# print ("\n\n")

# print ("Predicted Result : \n")
# print (str(predicted) + "\n\n")


In [0]:
pred2 = pipeline.predict(['friend corns'])

In [0]:
pred2

array(['561310.0'], dtype='<U8')

## Exporting the model
Model exported as a Pickle fime for future use.

In [0]:
import pickle
with open('industry_by_company_name', 'wb') as picklefile:
    pickle.dump(pipeline,picklefile)