In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np

## Preparation

### Package loading

(To do once per environment)

Clone the package from https://github.com/ClemenceK/bpifrance_deeptech_analysis
(You probably have already done it if you have this notebook!)

In a terminal, move to this project's folder (one step above this notebook, where the setup.py is) in your terminal 
and install from there: 

`pip install .`

Clone in the same place as the first package the package from: https://github.com/ClemenceK/deep4deep

In the terminal, stay in the same folder as before and install the deep4deep package:
`pip install ../../deep4deep/`

In [3]:
#this can be used to test if the import worked
from deep4deep.utils import simple_time_tracker

@simple_time_tracker
def test():
    print("It works if below is printed something like: test 0.0, or another figure")
    
test()

It works if below is printed something like: test 0.0, or another figure
test 0.0


### Local files preparation

In the bpideep folder, create a .env file, and in this file write:

DEALROOMAPIKEY= 'your_key' (replace by your key)

(this is to avoid loading your key on github: .env is mentionned in .gitignore to not be uploaded)

## Building the base data from dealroom

### Get data from Dealroom with the functions written by the former Wagon team   

- **By name** : use the function `company_search` on each individual company:

```
from bpideep.getdata import company_search
company_search("verkor")
```

⚠️⚠️⚠️ 
**Important note: the search for new companies would need to be performed by batch if you have a limited number of Dealroom API calls (or if you are billed according to the number of calls)**

- If want to use the **Dealroom ID**, you will need to adapt the function `getdata.getfulldata()` (former wagon team function). It works by batch, you can also use it as an example if you want to work by batch with names.

In [4]:
from bpideep.getdata import company_search

**Code to get Dealroom data for our 9 companies:**

new = ["verkor", "angell", "carbios","mastergrid","pasqal","gourmey", "Epigene Labs","SpaceSense","Kraaft"]
df = pd.DataFrame()

for company in new :
    
    tmp = company_search(company)
    df = pd.concat([df, tmp], ignore_index=True)

df.to_csv("../bpideep/rawdata/demo_data.csv", index=False)
df.head(2)

Note: convert the above cell to code to run it; we tried to minimize API calls. 

If you are just looking, copy demo_data.csv from Google Drive>to copy in raw_data to your bpideep/rawdata folder (to create if needed – we don't synchronize data on github because it can be heavy) then use the code in the cell below to load it. You can also try it on just 1 or 2 companies to limit API calls.
Google drive: https://drive.google.com/drive/folders/1PJYZ9hHrgyLLVS8mhweoytEKgQRAeLT6

In [5]:
# code to load demo_data.csv if you are just looking for a demo (after copying it from Google Drive)
# run it anyway if you used the code in the previous cell to save data (you can change the file name in both cells)
# as the 'load_json_field' function below is made to work on a loaded csv
df = pd.read_csv("../bpideep/rawdata/demo_data.csv")

### Select the needed columns to make the data analysis easier

In [6]:
new_data = df[["id", "name", "total_funding_source", "employees",
               "employees_latest", "launch_year", "growth_stage", 
               "linkedin_url", "industries", "investors", "team", "website_url"]].copy()
new_data.head(2)
### 'team' is kept in order to get phds for part 3.4 where we create 1 feature with all phds from dealroom and LinkedIN

Unnamed: 0,id,name,total_funding_source,employees,employees_latest,launch_year,growth_stage,linkedin_url,industries,investors,team,website_url
0,1985985,Verkor,0,2-10,9.0,2020,seed,https://www.linkedin.com/company/verkor/,"[{'id': 100023, 'name': 'energy'}]","{'items': [{'id': 869605, 'name': 'EIT InnoEne...","{'items': [{'id': 2002501, 'name': 'Benoit L.'...",http://verkor.com/
1,1841152,Angell,10000000,11-50,25.0,2018,early growth,https://www.linkedin.com/company/angell,"[{'id': 100111, 'name': 'transportation'}]","{'items': [{'id': 1476722, 'name': 'Groupe SEB...","{'items': [{'id': 57584, 'name': 'Marc Simonci...",https://angell.bike/


### Add nb_patents to the new_data. 

Note: In our example, as we don't have an extract from Google Patents Search, we will only create a column "nb_patents" with 0 in it.

In [7]:
new_data["nb_patents"] = np.full([new_data.shape[0], 1], 0)

### Create a new column "age" to get the age of the company thanks to the column "launch_year"

In [8]:
from datetime import datetime
current_year = datetime.today().year
current_year

2020

In [9]:
new_data["age"] = current_year - new_data.launch_year

### Create a new feature "investors_type"

In [10]:
# loading a few functions to help create investors_type column
from bpideep.GetCleanData import load_json_field, get_health, investors_type, simple_fund_investors

In [11]:
# from string back to json for fields that have been "stringified" by saving to csv
new_data["investors"] = new_data["investors"].apply(load_json_field)

#extracting the types of investors from the json
new_data["investors_type"] = new_data["investors"].map(investors_type)

#encoding as 0 or 1
new_data.loc[:,'investors_type'] = new_data['investors_type'].map(simple_fund_investors)

new_data.head(2)

Unnamed: 0,id,name,total_funding_source,employees,employees_latest,launch_year,growth_stage,linkedin_url,industries,investors,team,website_url,nb_patents,age,investors_type
0,1985985,Verkor,0,2-10,9.0,2020,seed,https://www.linkedin.com/company/verkor/,"[{'id': 100023, 'name': 'energy'}]","{'items': [{'id': 869605, 'name': 'EIT InnoEne...","{'items': [{'id': 2002501, 'name': 'Benoit L.'...",http://verkor.com/,0,0,1
1,1841152,Angell,10000000,11-50,25.0,2018,early growth,https://www.linkedin.com/company/angell,"[{'id': 100111, 'name': 'transportation'}]","{'items': [{'id': 1476722, 'name': 'Groupe SEB...","{'items': [{'id': 57584, 'name': 'Marc Simonci...",https://angell.bike/,0,2,0


Note: if you get mistakes, it might be because you ran the same cell multiple times, so the columns are not as expected anymore. Rerun from 2.1

### Create a new feature "health_industry"

In [12]:
new_data["health_industry"] = get_health(new_data["industries"])
new_data.head(2)

Unnamed: 0,id,name,total_funding_source,employees,employees_latest,launch_year,growth_stage,linkedin_url,industries,investors,team,website_url,nb_patents,age,investors_type,health_industry
0,1985985,Verkor,0,2-10,9.0,2020,seed,https://www.linkedin.com/company/verkor/,"[{'id': 100023, 'name': 'energy'}]","{'items': [{'id': 869605, 'name': 'EIT InnoEne...","{'items': [{'id': 2002501, 'name': 'Benoit L.'...",http://verkor.com/,0,0,1,0
1,1841152,Angell,10000000,11-50,25.0,2018,early growth,https://www.linkedin.com/company/angell,"[{'id': 100111, 'name': 'transportation'}]","{'items': [{'id': 1476722, 'name': 'Groupe SEB...","{'items': [{'id': 57584, 'name': 'Marc Simonci...",https://angell.bike/,0,2,0,0


## Scraping and adding the LinkedIn data

### Generate COMPANIES scripts for webscraper and scrape LinkedIn
For a better understanding use the Florent_demo notebook (in notebooks folder and/or documentation floder) to scrape files from Linked In (request a demo to Florent Martin if needed!)

The scraped data should be included in a folder `bpi_deep/scraping_data/companies_people/`
prior to calling the function `build employee_df`

In [13]:
from bpideep.scraping_scripting import make_script_company_scraping
# here we use a batch size of 10 but you can use a larger one, until 100
make_script_company_scraping(new_data,10)

script_batch_0
{"_id":"scraping","startUrl":["https://www.linkedin.com/company/verkor//people", "https://www.linkedin.com/company/angell/people", "https://www.linkedin.com/company/carbios/people", "https://www.linkedin.com/company/mastergrid//people", "https://www.linkedin.com/company/pasqal/people", "https://www.linkedin.com/company/gourmey/people", "https://www.linkedin.com/company/epigene-labs/people", "https://www.linkedin.com/company/spacesense-co/people", "https://www.linkedin.com/company/kraaft-co/people"],"selectors":[                    {"id":"container","type":"SelectorElementScroll","parentSelectors":["_root"],"selector":"div.org-people-profile-card__profile-info","multiple":true,"delay":"1234"},                        {"id":"name","type":"SelectorText","parentSelectors":["container"],"selector":"div.org-people-profile-card__profile-title","multiple":false,"regex":"","delay":0},                        {"id":"title","type":"SelectorText","parentSelectors":["container"],"selec

These scripts will be needed to use web scraper:
https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn

**How to use webscraper to generate csv**
+ Install the extension
+ On Chrome: click vertical three points on top right> More tools> Developper tools
+ Then on right click > Inspect (on any page) you should see "Webscraper" as one of the tools
    + It is recommended to use "Dock to bottom" configuration in "Dock side" parameter for a better view
    
    
+ Click Webscraper > **Create New Site Map > Import Site Map**
+ Paste the description from the script obtained above in "Sitemap JSON"
    + Each script starts with { and ends with }
    + The different scripts are delimited by "script_batch_0", "script_batch_1"… you can only do one at a time
+ Give the script name (e.g. script_batch_0) in "Rename Sitemap" (it will be the name of the csv file you'll obtain)
+ Click "Import sitemap"
+ Click the "Sitemap (your chosen name)" menu
+ Click "Scrape"
+ Click "Start Scraping"
    + A browser window opens and loads the pages to scrape
    + you can keep working on other things meanwhile
    + You may occasionnaly be signed out from Linked In: just sign in again and reload scraping
    + You can hit the "refresh" button on the initial page to see already scraped data
    + You know it is finished when the new browser window closes
+ Once finished, 
    + click the "Sitemap (your chosen name)" menu again then "Import data as CSV"
    + click "Download now"
    + Chose folder `bpi_deep/scraping_data/companies_people/` (create it if needed, as it is not uploaded on github)
+ Then repeat from " Create New Site Map > Import Site Map" for the next script until all scripts have been covered
    + When a script has "startUrl":[ ] and no pop up window opens, it's that all requested companies have been covered

### Process the scraped data to a dataframe

Create a `result_files` directory inside the `bpi_deep/scraping_data/` folder.

In [14]:
from bpideep.process_scraped_data import build_employee_df, process_employee_data

In [15]:
# The csv containing the scraped data should be included in a folder `bpi_deep/scraping_data/companies_people/`
# prior to calling the function "build employee_df"
df_employees= process_employee_data(build_employee_df())

In [16]:
#Example of the content of the employee dataframe after processing
df_employees.head()

Unnamed: 0,employee_name,title,profile-href,linkedin_url,technical,founder,phd
0,Agnès Mathé,responsable communication,https://www.linkedin.com/in/agn%C3%A8s-math%C3...,https://www.linkedin.com/company/carbios,0,0,0
1,Loic Zangara,vice-president france & operations,https://www.linkedin.com/in/loic-zangara-b8190...,https://www.linkedin.com/company/mastergrid,0,0,0
2,Gilles Stedile,superviseur chantier,https://www.linkedin.com/in/gilles-stedile-28b...,https://www.linkedin.com/company/mastergrid,0,0,0
3,,directeur technique,,https://www.linkedin.com/company/mastergrid,1,0,0
4,Meryl Merloz,purchaser,https://www.linkedin.com/in/merylmerloz/,https://www.linkedin.com/company/mastergrid,0,0,0


### Generate EMPLOYEES profile scripts for webscraper and scrape LinkedIn
Also documented in the Florent_demo notebook (in notebooks folder and/or documentation floder)

In [17]:
from bpideep.scraping_scripting import make_script_employee_scraping

Create the folder `bpideep/scraping_data/scraping_scripts`

In [18]:
make_script_employee_scraping(df_employees, 100, founders = True)

script_batch_0
{"_id":"profiles","startUrl":["https://www.linkedin.com/in/antoine-davydoff-35a569149/", "https://www.linkedin.com/in/alain-marty-40251539/", "https://www.linkedin.com/in/christophe-mille-506729/", "https://www.linkedin.com/in/nicolasmorinforest/", "https://www.linkedin.com/in/pauline-de-breteuil/", "https://www.linkedin.com/in/sylvainpaineau/", "https://www.linkedin.com/in/dekelpersi/", "https://www.linkedin.com/in/philippechain/", "https://www.linkedin.com/in/victor-sayous-a70190106/", "https://www.linkedin.com/in/matthieu-marquenet/", "https://www.linkedin.com/in/marc-negre-9548a58b/", "https://www.linkedin.com/in/fran%C3%A7ois-dechelette-357b481a/", "https://www.linkedin.com/in/eliott-raoult/", "https://www.linkedin.com/in/akpelinordor/", "https://www.linkedin.com/in/benoit-l-89772a2/", "https://www.linkedin.com/in/barriere/", "https://www.linkedin.com/in/christophe-jurczak/", "https://www.linkedin.com/in/sami-yacoubi-05902992/", "https://www.linkedin.com/in/martin-j

[["https://www.linkedin.com/in/antoine-davydoff-35a569149/",
  "https://www.linkedin.com/in/alain-marty-40251539/",
  "https://www.linkedin.com/in/christophe-mille-506729/",
  "https://www.linkedin.com/in/nicolasmorinforest/",
  "https://www.linkedin.com/in/pauline-de-breteuil/",
  "https://www.linkedin.com/in/sylvainpaineau/",
  "https://www.linkedin.com/in/dekelpersi/",
  "https://www.linkedin.com/in/philippechain/",
  "https://www.linkedin.com/in/victor-sayous-a70190106/",
  "https://www.linkedin.com/in/matthieu-marquenet/",
  "https://www.linkedin.com/in/marc-negre-9548a58b/",
  "https://www.linkedin.com/in/fran%C3%A7ois-dechelette-357b481a/",
  "https://www.linkedin.com/in/eliott-raoult/",
  "https://www.linkedin.com/in/akpelinordor/",
  "https://www.linkedin.com/in/benoit-l-89772a2/",
  "https://www.linkedin.com/in/barriere/",
  "https://www.linkedin.com/in/christophe-jurczak/",
  "https://www.linkedin.com/in/sami-yacoubi-05902992/",
  "https://www.linkedin.com/in/martin-j-stepha

The scripts are saved in the folder `bpideep/scraping_data/scraping_scripts`, you can open them with Sublime Text or another text editor.

Using the same process as before with Webscraper, scrape Employees using the scripts generated.
Only difference: this time save them in folder: `bpi_deep/scraping_data/founders_files/`


The scraped data should be included in a folder `bpi_deep/scraping_data/founders_files/`
prior to calling the function `open_founder_profile_files`

### Process the scraped data to a dataframe

In [19]:
# Prior to calling the function "open_founder_profile_files", the csv containing the scrapped data from founders
# should be included in a folder 'bpi_deep/scraping_data/founders_files/'
from bpideep.process_scraped_data import open_founder_profile_files, inline_profile
from bpideep.process_scraped_data import build_founders_dataframe, generate_founders_features
df_founders_raw = open_founder_profile_files()

In [20]:
#The function "build_founders_dataframe" processes the raw df and returns a df with one line per founder
#The function "generate_founders_features" generates the new relevant features such as "founder_has_phd" etc..
df_founders = generate_founders_features(build_founders_dataframe(df_founders_raw))

# It generates many warnings but none is in our own files 
# (all in pandas/core/frame.py or indexing.py) so please ignore them

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pyd

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pyd

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pyd

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur

In [21]:
df_founders.head(3)

Unnamed: 0,profile-href,title,company,exp_description,title_2,company_2,exp_description_2,title_3,company_3,exp_description_3,...,type_4,amount_4,text_content_4,type_5,amount_5,text_content_5,founder_has_phd,founder_from_institute,founder_pat_pub,technical_founder
0,https://www.linkedin.com/in/sami-yacoubi-05902...,Engineering Intern,VINCI Construction,-Developement of a hybrid mobile app that allo...,Travel semester in New-Zealand,New-Zealand,Semester spent away from university in New-Zea...,Study semester in Austria,Graz University of Technology,"Erasmus exchange program in Graz, Austria to i...",...,,,,,,,0,0,0,0
1,https://www.linkedin.com/in/sylvainpaineau/,Board Member,Cocoon Care,,Administrateur,Tenerrdis,,Co-Founder & Chief Strategy & Partnerships Off...,Verkor\n Permanent,,...,,,,,,,0,0,0,0
2,https://www.linkedin.com/in/victor-sayous-a701...,Internship,Université de Bordeaux,GESVAB-Institut des Sciences de la Vigne et du...,Internship,CNRS - Centre national de la recherche scienti...,UMR 7245 CNRS/MNHN Département Régulation-Déve...,Co-Founder & CTO,GOURMEY,On a mission to bring delicious cultivated mea...,...,,,,,,,1,1,0,1


In [22]:
# Finally, we merge founders to the full employee DF, update the feature "technical", and aggregate into companies
from bpideep.process_scraped_data import companies_technical_stats_with_founders_features, update_technical
df_employees_full = update_technical(df_employees, df_founders)
df_companies_stats_with_founders_features = companies_technical_stats_with_founders_features(df_employees_full)
df_companies_stats_with_founders_features

Unnamed: 0,linkedin_url,technical,phd_linkedin,employee__linkedin_count,founder_from_institute,founder_has_phd,founder_pat_pub,technical_founder,no_linkedin_data
0,https://www.linkedin.com/company/carbios,0.315789,1.0,38,0.0,1.0,0.0,1.0,0
1,https://www.linkedin.com/company/carester,0.571429,1.0,7,0.0,0.0,0.0,0.0,0
2,https://www.linkedin.com/company/epigene-labs,0.25,1.0,12,0.0,1.0,0.0,1.0,0
3,https://www.linkedin.com/company/gourmey,0.526316,5.0,19,2.0,1.0,0.0,2.0,0
4,https://www.linkedin.com/company/kraaft-co,0.142857,0.0,14,0.0,0.0,0.0,0.0,0
5,https://www.linkedin.com/company/mastergrid,0.183908,0.0,87,0.0,0.0,0.0,0.0,0
6,https://www.linkedin.com/company/pasqal,0.5,4.0,16,0.0,2.0,2.0,2.0,0
7,https://www.linkedin.com/company/spacesense-ai,0.307692,1.0,13,0.0,0.0,0.0,0.0,0
8,https://www.linkedin.com/company/verkor,0.333333,0.0,12,1.0,0.0,1.0,1.0,0


----
Please **delete the two cells below** when running the notebook for yourself (they are just corrections because we did the scraping with urls not coming from dealroom, and it would be long to scrape again – but if you follow this notebook you won't have the same problem)

In [23]:
# Here we must correct for a Linkedin URL mismatched (scraping was done with the url 
# https://www.linkedin.com/company/mastergrid whereas the dealroom url has a slash at the end).
# Note that there would not be an issue when running the process in correct order
# by first doing the dealroom query and then building the scraping script
old_url = 'https://www.linkedin.com/company/mastergrid'
new_url = 'https://www.linkedin.com/company/mastergrid/'
df_companies_stats_with_founders_features[df_companies_stats_with_founders_features['linkedin_url'] == old_url]
df_companies_stats_with_founders_features.loc\
    [df_companies_stats_with_founders_features['linkedin_url'] == old_url, "linkedin_url"]= new_url

In [24]:
# Similar issue for spacesense
old_url = 'https://www.linkedin.com/company/spacesense-ai'
new_url = 'https://www.linkedin.com/company/spacesense-co'
df_companies_stats_with_founders_features[df_companies_stats_with_founders_features['linkedin_url'] == old_url]
df_companies_stats_with_founders_features.loc\
    [df_companies_stats_with_founders_features['linkedin_url'] == old_url, "linkedin_url"]= new_url

---

In [25]:
df_companies_stats_with_founders_features

Unnamed: 0,linkedin_url,technical,phd_linkedin,employee__linkedin_count,founder_from_institute,founder_has_phd,founder_pat_pub,technical_founder,no_linkedin_data
0,https://www.linkedin.com/company/carbios,0.315789,1.0,38,0.0,1.0,0.0,1.0,0
1,https://www.linkedin.com/company/carester,0.571429,1.0,7,0.0,0.0,0.0,0.0,0
2,https://www.linkedin.com/company/epigene-labs,0.25,1.0,12,0.0,1.0,0.0,1.0,0
3,https://www.linkedin.com/company/gourmey,0.526316,5.0,19,2.0,1.0,0.0,2.0,0
4,https://www.linkedin.com/company/kraaft-co,0.142857,0.0,14,0.0,0.0,0.0,0.0,0
5,https://www.linkedin.com/company/mastergrid/,0.183908,0.0,87,0.0,0.0,0.0,0.0,0
6,https://www.linkedin.com/company/pasqal,0.5,4.0,16,0.0,2.0,2.0,2.0,0
7,https://www.linkedin.com/company/spacesense-co,0.307692,1.0,13,0.0,0.0,0.0,0.0,0
8,https://www.linkedin.com/company/verkor,0.333333,0.0,12,1.0,0.0,1.0,1.0,0


### Last step: merge the the Linked In scraping with new_data
Merge is done on Linked In urls

In [26]:
from bpideep.process_scraped_data import merge_initial_companies_with_founder
final = merge_initial_companies_with_founder(new_data, df_companies_stats_with_founders_features)

In [27]:
final.drop(columns=['team'], inplace=True)
#Remove 'team' after it was used to update phds.

## Finalizing the dataframe

### To use the pipeline, name and order of the passed columns should be the same

In [28]:
expected_columns = ['id', 'name', 'total_funding_source', 'employees', 'employees_latest',
       'launch_year', 'growth_stage', 'linkedin_url', 'industries',
       'investors', 'launch_year_clean', 'growth_stage_imputed',
       'employees_clean', 'age', 'nb_patents', 'investors_type',
       'health_industry', 'company_has_phd', 'proportion_technical',
       'founder_from_institute', 'founder_has_phd', 'No_people_input']

In [29]:
final_for_NLP = final.copy()

In [30]:
final["launch_year_clean"] = final.launch_year # keeping both columns for consistency
final["growth_stage_imputed"] = final.growth_stage # keeping both columns for consistency
final['employees_clean'] = final.employees_latest # keeping both columns for consistency
final['company_has_phd'] = (final['phd_total'] > 0).map(int) 
# encoding number of phds (taking into account the data not only from Dealroom but also dfrom LinkedIn)
# as a yes/no feature

final.rename(columns={"no_linkedin_data" : "No_people_input",
                        "technical" : "proportion_technical"
                        }, inplace=True)
final.drop(columns=['deal_room_phd', 'employee__linkedin_count', 'founder_pat_pub', 'phd_linkedin', 
                        'technical_founder', 'phd_total'], inplace=True)    
final = pd.DataFrame(final, columns=expected_columns)

### Checking the NaNs and filling manually if needed

In [31]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      9 non-null      int64  
 1   name                    9 non-null      object 
 2   total_funding_source    9 non-null      int64  
 3   employees               9 non-null      object 
 4   employees_latest        8 non-null      float64
 5   launch_year             9 non-null      int64  
 6   growth_stage            9 non-null      object 
 7   linkedin_url            9 non-null      object 
 8   industries              9 non-null      object 
 9   investors               9 non-null      object 
 10  launch_year_clean       9 non-null      int64  
 11  growth_stage_imputed    9 non-null      object 
 12  employees_clean         8 non-null      float64
 13  age                     9 non-null      int64  
 14  nb_patents              9 non-null      int64 

It is always best to have real data, even approximated and filled manually, but it is allowed to have empty or NaN cells for the following fields before entering the pipeline:
 
+ number of employees: "employees_latest" and "employees_clean"
+ "age"
+ "nb_patents"

The pipeline will be able to approximate the missing data, using the average or most frequent data in the training set. 

⚠️ If you have empty cells in any other fields, the model will throw an error, so make sure to fill them.

In our example, all 9 firms have a growth_stage indicated in Dealroom, but **growth_stage and launch_year** are especially prone to be missing in Dealroom and must be checked.

In [32]:
# filling the NaN values in all columns that are not imputed by the pipeline
final.fillna(value={'proportion_technical':0, 'founder_from_institute':0, 'founder_has_phd':0 },inplace=True)
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      9 non-null      int64  
 1   name                    9 non-null      object 
 2   total_funding_source    9 non-null      int64  
 3   employees               9 non-null      object 
 4   employees_latest        8 non-null      float64
 5   launch_year             9 non-null      int64  
 6   growth_stage            9 non-null      object 
 7   linkedin_url            9 non-null      object 
 8   industries              9 non-null      object 
 9   investors               9 non-null      object 
 10  launch_year_clean       9 non-null      int64  
 11  growth_stage_imputed    9 non-null      object 
 12  employees_clean         8 non-null      float64
 13  age                     9 non-null      int64  
 14  nb_patents              9 non-null      int64 

## Using the main model to predict whether the company is deeptech

Use the model already trained on our entire dataset except almost_deep_tech and some duplicates (1332 observations) : 

In [33]:
# The "final" dataframe should have 22 fields
final.shape[1]

22

In [34]:
# pip install dill if needed
# it is like pickle but handles some aspects better, e.g. lambda functions

import dill as pickle
 
# Load pipeline from pickle file
my_pipeline = pickle.load(open("../bpideep/bpideepmodelnew.pkl","rb"))

In [35]:
# Predict the Labels using the reloaded Model
y_pred = pd.DataFrame(my_pipeline.predict_proba(final), columns=['0','1'])
y_pred

Unnamed: 0,0,1
0,0.059667,0.940333
1,0.861107,0.138893
2,0.26,0.74
3,0.828095,0.171905
4,0.08,0.92
5,0.02,0.98
6,0.19,0.81
7,0.43,0.57
8,0.65,0.35


## Scraping and using text data

The input needs to have the following fields:
[TODO]

Then use the following functions:

In [38]:
final_for_NLP.website_url

0             http://verkor.com/
1           https://angell.bike/
2          https://carbios.fr/en
3          http://mastergrid.com
4             https://pasqal.io/
5           https://gourmey.com/
6    http://www.epigenelabs.com/
7         https://spacesense.co/
8         https://www.kraaft.co/
Name: website_url, dtype: object

## Using the NLP model to predict whether the company is deeptech

## Making the two predictions "vote" for final prediction

## Note on retraining a model with more labelled data
When you have labelled new data as deep_tech or not and want to train a new version of the model on more data:

+ To get the data from Dealroom if you have the dealroom ids, you can use the function written by the former Wagon team:
    + List all the companies Dealroom ID you want to analyse in 3 different csv according to the companies classification (deeptech, non_deeptech, almost_deeptech) and save these three csv in the folder "data". 
    + Import and use the function `getdata.getfulldata()` (former wagon team function) to get the new companies data from Dealroom and save the csv in the local folder `bpideep/rawdata`
+ Due to the short time we had, growth_stage and launch_year imputing have been handled outside of the pipeline, in the dataset preparation.We suggest you may want to integrate imputing of those and maybe more data to the pipeline so that any dataset of new observations can directly be handled by the pipeline.
+ You may need the function GetCleanData.get_clean_data(), that encompasses several steps we did manually in this notebook (creating the health_industry column, etc.). This function is adapted to the dataset as we had it in Decembre 2020, so you will probably need to evolve it a bit
+ To use the function GetCleanData.get_clean_data(), don't forget to save the csv files (for the patents and LinkedIn data) in the folder "data", and replace the name of the csv if different from the name written in the function.