# Project Category Identification NLP Project <a id='Data Overview'></a>

## Table of Contents <a id='2.1_Contents'></a>
* [0. Background](#0)
* [1. Data Import - Building Dataset](#1)
* [2. Target Location Selection](#2)
* [3. Data Understanding](#3)
    * [3.1 Missing Values](#3.1)
    * [3.2 Meter Type Observation](#3.2)
    * [3.3 Building Area Observation](#3.3)
    * [3.4 Building Category Observation](#3.4)
    * [3.5 Building Energy Use Intensity](#3.5)
    * [3.6 Other Building Parameters](#3.6)
* [4.0 Save Data](#4.0)

## 0. Background

I built a web scrapping tool that scraps biding and tendering information from public sectors in Ontario. The public sectors includes regions such as York Region, Peel Region, Halton Region, and cities such as City of Mississauga.

The intend of the project is to use DATA to tell a story about the Competitiveness of the public sector. The collected information are used to conduct the following analysis:

* **Public Sector Market Understanding**: an overview of the public sector markets (services, construction, and goods), how the market changes over time.

* **Competitive Information**: an overview of the performance of private companies in different public sectors. What is their wining rate comparing with their competitors. The private companies includes construction companies such as Maple Reinders Constructors Ltd. and Kenaidan Contracting Ltd., consulting companies such WSP Canada Inc., Hatch Ltd., and good supplying companies.

* **Competitiveness by Category** Different private companies in the public sectors have overlapping skill sets in the public sectors, they may also specialize a niche that company leaders and strategist may not be aware off. The tool intends to use DATA to show competitiveness of each companies in different sectors such as consulting in water & wastewater, construction in civil engineering works, etc.

The collected information includes project names and a generic category information. In order to understand "Competitiveness by Category", the original "generic category" needs to further broken down into detailed categories. There are over 10k project collected, it is not practical to provide a detailed category to each of the project,. Further more, as more projects be collected, manual tagging the category is timing consuming.

Therefore, Natural Language Processing (NLP) is used to do the category task for current projects and any future projects.

## 0.1 Import Libraries

In [59]:
import pandas as pd
import numpy as np

## 0.2 Import Original Dataset

In [62]:
path = "../NaturalLanguageProcessing/Data.csv"
df = pd.read_csv(path)

## 1. Data Wrangling

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48207 entries, 0 to 48206
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   client_name         48207 non-null  object 
 1   project_name        48207 non-null  object 
 2   Category            593 non-null    object 
 3   bid_classification  38805 non-null  object 
 4   bid_type            48114 non-null  object 
 5   bid_ID              48207 non-null  object 
 6   awarded_date        48202 non-null  object 
 7   awarded_year        48202 non-null  float64
 8   company_name        48016 non-null  object 
 9   submitted_price     30430 non-null  object 
 10  winning_status      48016 non-null  float64
dtypes: float64(2), object(9)
memory usage: 4.0+ MB


In [64]:
df.head()

Unnamed: 0,client_name,project_name,Category,bid_classification,bid_type,bid_ID,awarded_date,awarded_year,company_name,submitted_price,winning_status
0,York Region,General Contractors And Subcontractors For Sen...,Construction - Restoration,Services,Pre-Qualification,PQ-20-156,Mar 5,2021.0,560789 Ontario Limited o/a R&M Construction,,1.0
1,York Region,General Contractors And Subcontractors For Sen...,Construction - Restoration,Services,Pre-Qualification,PQ-20-156,Mar 5,2021.0,Brinkman and Associates Reforestation Limited,,0.0
2,York Region,General Contractors And Subcontractors For Sen...,Construction - Restoration,Services,Pre-Qualification,PQ-20-156,Mar 5,2021.0,Cambridge Landscaping & Construction Ltd,,1.0
3,York Region,General Contractors And Subcontractors For Sen...,Construction - Restoration,Services,Pre-Qualification,PQ-20-156,Mar 5,2021.0,CSL Group Ltd,,0.0
4,York Region,General Contractors And Subcontractors For Sen...,Construction - Restoration,Services,Pre-Qualification,PQ-20-156,Mar 5,2021.0,Dynex Construction Inc.,,1.0


## 1.1 Data Explanation

### 1.1.1 Data Explanation - client_name

In [42]:
client_list = df.client_name.unique().tolist()
len(client_list)

19

In [43]:
client_list

['York Region',
 'Waterloo Region',
 'Simcoe County',
 'Peterborough County',
 'Niagara Region',
 'Halton Region',
 'Haldimand County',
 'Durham Region',
 'Dufferin County',
 'City Of Peterborough',
 'City of Orillia',
 'City of London',
 'City of Kawartha Lakes',
 'City of Hamilton',
 'City of Guelph',
 'City of Brantford',
 'City of Barrie',
 'Brant County',
 'Peel Region']

There are 19 clients included in the dataset. The client name will not be used as an input for NLP because the client name does not have any affect on the project name and project category.

### 1.1.2 Data Explanation - Category

In [44]:
df.Category.describe()

count                      593
unique                      11
top       Construction - Civil
freq                       155
Name: Category, dtype: object

In [45]:
df.Category.unique().tolist()

['Construction - Restoration',
 'Supplier',
 'Consulting - Water Wastewater Linear',
 'Consulting - Transportation',
 'Construction - Health',
 'Construction - Transportation',
 'Construction - Facility',
 'Construction - Civil',
 'Consulting - Water Wastewater Vertical',
 'Construction - Water Wastewater',
 'Consulting - Others',
 nan]

The category is our target. All the existing values are added by me. We will remove them and add category later.

### 1.1.3 Data Explanation - bid_classification

In [46]:
df.bid_classification.unique()

array(['Services', 'Construction', 'Goods', nan], dtype=object)

In [50]:
df.bid_classification.value_counts(dropna = False)

Services        16952
Construction    16802
NaN              9402
Goods            5051
Name: bid_classification, dtype: int64

In [55]:
df[df.bid_classification.isnull()].head()

Unnamed: 0,client_name,project_name,Category,bid_classification,bid_type,bid_ID,awarded_date,awarded_year,company_name,submitted_price,winning_status
6433,York Region,Emergency Repairs To Roads And Related Road Fa...,,,Tender,T-15-94,Mar 15,2016.0,"K.J. Beamish Construction Co., Limited","$2,979,666.00",0.0
6434,York Region,Emergency Repairs To Roads And Related Road Fa...,,,Tender,T-15-94,Mar 15,2016.0,614128 Ontario Ltd o/a Trisan Construction,"$5,532,704.00",0.0
6571,York Region,Construction Of The Midblock Collector Road An...,,,Tender,T-14-53,Apr 12,2016.0,Aecon Construction and Materials Limited,"$41,375,923.41",0.0
6572,York Region,Construction Of The Midblock Collector Road An...,,,Tender,T-14-53,Apr 12,2016.0,Brennan Paving & Construction Ltd.,"$35,811,814.16",1.0
6573,York Region,Construction Of The Midblock Collector Road An...,,,Tender,T-14-53,Apr 12,2016.0,Coco Paving Inc.,"$41,589,331.45",0.0


## 3.0 Data Observation

### 3.1 Obtain data
Use pandas to read the data I scrapped.

In [28]:
excel_path = '/Users/delinmu/Documents/GitHub/BidTing/result/data.xlsx'
bidding_info = pd.read_excel(excel_path, sheet_name='Ori')

In [None]:
# a quick view of bidding_info dataframe
bidding_info.head()

### 3.2 Dataframe Observation
A couple of questions I want to know here. 
1. How many unique clients in the dataframe?
2. How many unique projects in the dataframe?
3. A table includes 3 columns: client_name, bid_id, project_name, and awarded_year for me to tell how many projects were on the project over the time span

### 3.2.1 Unique Clients
A simple value_counts() method is used to reveal the information. Alternatively, nunique() can also reveal the information.

In [39]:
# view of the data type of my dataframe
bidding_info.client_name.value_counts()

York Region               9726
Peel Region               5519
City of Guelph            5227
Waterloo Region           5067
City of Hamilton          3857
City of Brantford         3111
Halton Region             2521
City of London            2453
Durham Region             2401
Niagara Region            1777
City of Barrie            1463
Simcoe County             1295
Brant County              1004
City of Kawartha Lakes     849
City of Orillia            493
Peterborough County        439
City Of Peterborough       429
Dufferin County            330
Haldimand County           246
Name: client_name, dtype: int64

Note that value count for each client does not represent number of unique projects.

In [40]:
len(bidding_info.client_name.value_counts())

19

In total, we have 19 clients collected in the dataframe. This is good to know how many geographical segementtation  we 