<a href="https://colab.research.google.com/github/Rishabh1928/Company_Classification_Clustering/blob/main/Rishabh_Kesarwani_Company_Classification_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Problem Statement :**

Classify businesses and companies across a standard taxonomy.

This dataset comes with pre-classified companies along with data from the website.

The main objective is to cluster companies based on their description on the website.

###**Overview of columns in the dataset :**

1. website: The website of the company/business
2. company_name: The company/business name
3. homepage_text : Visible homepage text
4. h1: The heading 1 tags from the html of the home page
5. h2: The heading 2 tags from the html of the home page
6. h3: The heading 3 tags from the html of the home page
7. navlinktext: The visible titles of navigation links on the homepage (Ex: Home, Services,
Product, About Us, Contact Us)
8. metakeywords: The meta keywords in the header of the page html for SEO (More info:
https://www.w3schools.com/tags/tag_meta.asp)
9. metadescription: The meta description in the header of the page html for SEO (More info:
https://www.w3schools.com/tags/tag_meta.asp)

In [2]:
# Loading the basic dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [30]:
# Loading data

file_path = "/content/drive/MyDrive/Almabetter/Capstone Project 2 (Company Classification)/DATA/data_company_classification.csv"
df = pd.read_csv(file_path)

**Basic Inspection**

In [4]:
# Checking head of df

df.head()

Unnamed: 0,website,company_name,homepage_text,h1,h2,h3,nav_link_text,meta_keywords,meta_description
0,bipelectric.com,bip dipietro electric inc,Electrici...,,,,,"electricians vero beach, vero beach electrical...","Providing quality, reliable full service resid..."
1,eliasmedical.com,elias medical,site map | en español Elias Medical h...,Offering Bakersfield family medical care from ...,Welcome to ELIAS MEDICAL#sep#Family Medical Pr...,Get To Know Elias Medical#sep#Family Medical P...,,Elias Medical bakersfield ca family doctor med...,For the best value in Bakersfield skin care tr...
2,koopsoverheaddoors.com,koops overhead doors,Home About Us Garage Door Repair & Servi...,,Customer Reviews#sep#Welcome to Koops Overhead...,,,"Koops Overhead Doors, Albany Garage Doors, Tro...","Koops Overhead Doors specializes in the sales,..."
3,midtowneyes.com,midtown eyecare,918-599-0202 Type Size...,,Welcome to our practice!,,,,We would like to welcome you to Midtown Eyecar...
4,reprosecurity.co.uk,repro security ltd,Simply fill out our form below...,,Welcome to REPRO SECURITY Ltd,,,,Repro Security provide a range of tailor made ...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73974 entries, 0 to 73973
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   website           73974 non-null  object
 1   company_name      73974 non-null  object
 2   homepage_text     73305 non-null  object
 3   h1                46653 non-null  object
 4   h2                53212 non-null  object
 5   h3                44659 non-null  object
 6   nav_link_text     48050 non-null  object
 7   meta_keywords     23672 non-null  object
 8   meta_description  66886 non-null  object
dtypes: object(9)
memory usage: 5.1+ MB


In [7]:
df.shape

(73974, 9)

`I need to takecare for headers, so I will be doing most of the things on h1 , h2 & h3`

In [8]:
df.h1.describe()

count     46653
unique    44133
top        Home
freq        630
Name: h1, dtype: object

In [9]:
df.h2.describe()

count                                                 53212
unique                                                50732
top       Follow Us:and share our news...#sep#UK.COM Awa...
freq                                                    107
Name: h2, dtype: object

In [10]:
df.h3.describe()

count                                                 44659
unique                                                42130
top       Safe Payments By Adyen#sep#Fast Domain Transfe...
freq                                                    117
Name: h3, dtype: object

In [38]:
# Constructing a df that will have headers only! (Will take company name as well for any reference)

df_headers = df[["company_name" , "h1" , "h2" , "h3"]]

In [15]:
df_headers.head()

Unnamed: 0,company_name,h1,h2,h3
0,bip dipietro electric inc,,,
1,elias medical,Offering Bakersfield family medical care from ...,Welcome to ELIAS MEDICAL#sep#Family Medical Pr...,Get To Know Elias Medical#sep#Family Medical P...
2,koops overhead doors,,Customer Reviews#sep#Welcome to Koops Overhead...,
3,midtown eyecare,,Welcome to our practice!,
4,repro security ltd,,Welcome to REPRO SECURITY Ltd,


In [17]:
df_headers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73974 entries, 0 to 73973
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  73974 non-null  object
 1   h1            46653 non-null  object
 2   h2            53212 non-null  object
 3   h3            44659 non-null  object
dtypes: object(4)
memory usage: 2.3+ MB


In [23]:
df_headers.shape

(73974, 4)

In [26]:
# Null values in header h1

print("Count of null values in h1:" , df_headers["h1"].isnull().sum())

# Percentage of null values for h1

print("Percentage of null values in h1: " , round(df_headers["h1"].isnull().sum() / df_headers.shape[0] * 100 , 2))

Count of null values in h1: 27321
Percentage of null values in h1:  36.93


In [27]:
# Null values in header h2

print("Count of null values in h2:" , df_headers["h2"].isnull().sum())

# Percentage of null values for h2

print("Percentage of null values in h2: " , round(df_headers["h2"].isnull().sum() / df_headers.shape[0] * 100 , 2))

Count of null values in h2: 20762
Percentage of null values in h2:  28.07


In [28]:
# Null values in header h3

print("Count of null values in h3:" , df_headers["h3"].isnull().sum())

# Percentage of null values for h3

print("Percentage of null values in h3: " , round(df_headers["h3"].isnull().sum() / df_headers.shape[0] * 100 , 2))

Count of null values in h3: 29315
Percentage of null values in h3:  39.63


*It's like loosing too much information if we opt for dropping null values!!*

**Let's combine h1 , h2 & h3 and then check for null..**

In [39]:
# Replacing null values with empty string (so that when we combine header, it won't screw up because of NaN)

df_headers = df_headers.fillna(" ")

In [40]:
df_headers.head()

Unnamed: 0,company_name,h1,h2,h3
0,bip dipietro electric inc,,,
1,elias medical,"Offering Bakersfield family medical care from pediatrics to geriatrics. Also offering skin care including Botox, Laser skin treatments and more.#sep#Elias Medical",Welcome to ELIAS MEDICAL#sep#Family Medical Practice#sep#SKIN CARE#sep#Schedule a Consultation\n661.663.0300,Get To Know Elias Medical#sep#Family Medical Practice#sep#Consultations#sep#Skin Care
2,koops overhead doors,,Customer Reviews#sep#Welcome to Koops Overhead Doors!,
3,midtown eyecare,,Welcome to our practice!,
4,repro security ltd,,Welcome to REPRO SECURITY Ltd,


In [41]:
# Merging h1 , h2 & h3

df_headers["headers"] = df_headers["h1"] + " " + df_headers["h2"] + " " + df_headers["h3"]

In [33]:
pd.set_option("display.max_colwidth" , -1)

  """Entry point for launching an IPython kernel.


In [42]:
df_headers.head()

Unnamed: 0,company_name,h1,h2,h3,headers
0,bip dipietro electric inc,,,,
1,elias medical,"Offering Bakersfield family medical care from pediatrics to geriatrics. Also offering skin care including Botox, Laser skin treatments and more.#sep#Elias Medical",Welcome to ELIAS MEDICAL#sep#Family Medical Practice#sep#SKIN CARE#sep#Schedule a Consultation\n661.663.0300,Get To Know Elias Medical#sep#Family Medical Practice#sep#Consultations#sep#Skin Care,"Offering Bakersfield family medical care from pediatrics to geriatrics. Also offering skin care including Botox, Laser skin treatments and more.#sep#Elias Medical Welcome to ELIAS MEDICAL#sep#Family Medical Practice#sep#SKIN CARE#sep#Schedule a Consultation\n661.663.0300 Get To Know Elias Medical#sep#Family Medical Practice#sep#Consultations#sep#Skin Care"
2,koops overhead doors,,Customer Reviews#sep#Welcome to Koops Overhead Doors!,,Customer Reviews#sep#Welcome to Koops Overhead Doors!
3,midtown eyecare,,Welcome to our practice!,,Welcome to our practice!
4,repro security ltd,,Welcome to REPRO SECURITY Ltd,,Welcome to REPRO SECURITY Ltd


In [44]:
# Now, dropping h1 , h2 & h3

df_headers.drop(["h1" , "h2" , "h3"] , axis = 1 , inplace=True)

In [45]:
df_headers.head()

Unnamed: 0,company_name,headers
0,bip dipietro electric inc,
1,elias medical,"Offering Bakersfield family medical care from pediatrics to geriatrics. Also offering skin care including Botox, Laser skin treatments and more.#sep#Elias Medical Welcome to ELIAS MEDICAL#sep#Family Medical Practice#sep#SKIN CARE#sep#Schedule a Consultation\n661.663.0300 Get To Know Elias Medical#sep#Family Medical Practice#sep#Consultations#sep#Skin Care"
2,koops overhead doors,Customer Reviews#sep#Welcome to Koops Overhead Doors!
3,midtown eyecare,Welcome to our practice!
4,repro security ltd,Welcome to REPRO SECURITY Ltd


In [46]:
# Checking empty values now!

df_headers.describe()

Unnamed: 0,company_name,headers
count,73974,73974.0
unique,73935,65081.0
top,longley concrete ltd,
freq,3,7461.0
