In [1]:
%load_ext autoreload
%autoreload 2

## Preparation

### Package loading

(To do once per environment)
Clone the package from https://github.com/ClemenceK/bpifrance_deeptech_analysis
(You probably have already done it if you have this notebook!)

In [None]:
! pip install ..
# or if you prefer move to that project folder (one step above this notebook, where the setup.py is) in your terminal 
# and install from there using pip install .

Clone in the same place the package from: https://github.com/ClemenceK/deep4deep

In [None]:
# install the bpifrance_deeptech_analysis package from here…
! pip install ../../deep4deep/
# or if you prefer move to that project folder (where the setup.py is) in your terminal 
# and install from there using pip install .

In [None]:
#this can be used to test if the import worked
from deep4deep.utils import simple_time_tracker

@simple_time_tracker
def test():
    print("It works if below is printed something like: test 0.0, or another figure")
    
test()

### Local files preparation

In the bpifrance_deeptech_analysis project folder, create a .env file (where the setup.py is), and in this file write:

DEALROOMAPIKEY= 'your_key' (replace by your key)

(this is to avoid loading your key on github, and .env should be mentionned in .gitignore to not be uploaded)

In bpifrance_deeptech_analysis, create a data and raw_data local folders, with inside:
[TODO if needed]

## Building the base data from dealroom

1. Get data from Dealroom with the functions written by the former Wagon team, either by : 

- ID : list all the companies Dealroom ID you want to analyse in 3 different csv according to the companies classification (deeptech, non_deeptech, almost_deeptech) and save these three csv in the folder "data". 
Use the function getdata.getfulldata() (former wagon team function) to get the new companies data from Dealroom and save the csv in the folder "rawdata"

- name : use the function "company_search" -> to be automatized 

In [2]:
from bpideep.getdata import company_search
import pandas as pd

Code to get Dealroom data for our 9 companies : run it only if you need to get new data + save the new_data

new = ["verkor", "angell", "carbios","mastergrid","pasqal","gourmey", "Epigene Labs","SpaceSense","Kraaft"]
df = pd.DataFrame()

for company in new :
    
    tmp = company_search(company)
    df = pd.concat([df, tmp], ignore_index=True)

df.to_csv("demo_data.csv", index=False)

In [3]:
new_data = pd.read_csv("../bpideep/rawdata/demo_data.csv")

2. Select the needed columns to make the data analysis easier

In [4]:
new_data = new_data[["id", "name", "total_funding_source", "employees",
                 "employees_latest", "launch_year", "growth_stage", "linkedin_url", "industries", "investors"]]

3. Add the nb_patents to the new_data. 

In our example, as we don't have an extract from Google Patents Search, we will only create a column "nb_patents" with 0 in it.

NB : To use the function GetCleanData.get_clean_data(), don't forget to save the csv files (for the patents and LinkedIn data) in the folder "data", and replace the name of the csv if different from the name written in the function.

In [5]:
new_data["nb_patents"] = np.full([new_data.shape[0], 1], 0)

4. Create a new column "age" to get the age of the company thanks to the column "launch_year"

In [6]:
new_data["age"] = 2020 - new_data.launch_year

5. Create a new feature "investors_type"

In [7]:
from bpideep.GetCleanData import load_json_field, get_health, fund_investors, investors_type

new_data["investors"] = new_data["investors"].apply(load_json_field)
new_data["investors_type"] = pd.DataFrame(new_data["investors"].apply(lambda row: investors_type(row)))
new_data["investors_type"] = fund_investors(new_data[["investors_type"]])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data["investors_type"] = fund_investors(new_data[["investors_type"]])


6. Create a new feature "health_industry"

In [8]:
new_data["health_industry"] = pd.DataFrame(get_health(new_data["industries"]))

7. To use the pipeline, name of the feature columns should be the same:

In [9]:
new_data.rename(columns={"employees_latest": "employees_clean", "launch_year": "launch_year_clean", "growth_stage": "growth_stage_imputed"})

Unnamed: 0,id,name,total_funding_source,employees,employees_clean,launch_year_clean,growth_stage_imputed,linkedin_url,industries,investors,nb_patents,age,investors_type,health_industry
0,1985985,Verkor,0,2-10,9.0,2020,seed,https://www.linkedin.com/company/verkor/,"[{'id': 100023, 'name': 'energy'}]","{'items': [{'id': 869605, 'name': 'EIT InnoEne...",0,0,1,0
1,1841152,Angell,10000000,11-50,25.0,2018,early growth,https://www.linkedin.com/company/angell,"[{'id': 100111, 'name': 'transportation'}]","{'items': [{'id': 1476722, 'name': 'Groupe SEB...",0,2,0,0
2,924274,Carbios,7400000,11-50,37.0,2011,early growth,https://www.linkedin.com/company/carbios,"[{'id': 100023, 'name': 'energy'}]","{'items': [{'id': 24770, 'name': 'Truffle Capi...",0,9,1,0
3,2434315,Mastergrid,0,51-200,,2019,late growth,https://www.linkedin.com/company/mastergrid/,[],"{'items': [], 'total': 0}",0,1,0,0
4,1685632,Pasqal,0,11-50,14.0,2019,early growth,https://www.linkedin.com/company/pasqal,"[{'id': 100120, 'name': 'semiconductors'}]","{'items': [{'id': 1218398, 'name': 'Quantonati...",0,1,1,0
5,1769618,Gourmey,50000,11-50,18.0,2019,early growth,https://www.linkedin.com/company/gourmey,"[{'id': 100008, 'name': 'food'}]","{'items': [{'id': 871041, 'name': 'European In...",0,1,1,0
6,1757959,Epigene Labs,1400000,11-50,11.0,2019,early growth,https://www.linkedin.com/company/epigene-labs,"[{'id': 1254, 'name': 'health'}]","{'items': [{'id': 885471, 'name': 'Agoranov', ...",0,1,1,1
7,1814695,SpaceSense,0,2-10,4.0,2018,seed,https://www.linkedin.com/company/spacesense-co,[],"{'items': [], 'total': 0}",0,2,0,0
8,1570787,Kraaft,0,11-50,12.0,2019,early growth,https://www.linkedin.com/company/kraaft-co,"[{'id': 100147, 'name': 'enterprise software'}]","{'items': [{'id': 965241, 'name': 'OPEO Startu...",0,1,1,0


## Scraping and adding the LinkedIn data

The first step is to generate scripts for webscraper and scrape LinkedIn -> live demo.

In [None]:
# Prior to calling the function "build employee_df", the csv containing the scrapped data should be included
# in a folder 'bpi_deep/scraping_data/companies_people/'
from bpideep.process_scraped_data import build_employee_df, process_employee_data
df_employees= process_employee_data(build_employee_df())

In [None]:
#Example of the content of the employee dataframe after processing
df_employees.head(5)

At this point, generate scripts for profile scraping on LinkedIn and scrape

In [None]:
# Prior to calling the function "open_founder_profile_files", the csv containing the scrapped data from founders
# should be included in a folder 'bpi_deep/scraping_data/founders_files/'
from bpideep.process_scraped_data import open_founder_profile_files, inline_profile
from bpideep.process_scraped_data import build_founders_dataframe, generate_founders_features
df_founders_raw = open_founder_profile_files()

In [None]:
#The function "build_founders_dataframe" processes the raw df and returns a df with one line per founder
#The function "generate_founders_features" generates the new relevant features such as "founder_has_phd" etc..
df_founders = generate_founders_features(build_founders_dataframe(df_founders_raw))

In [None]:
df_founders.head(5)

In [None]:
# Finally, we merge founders to the full employee DF, update the feature ("technical"), and aggregate into companies
from bpideep.process_scraped_data import companies_technical_stats_with_founders_features, update_technical
df_employees_full = update_technical(df_employees, df_founders)
df_companies_stats_with_founders_features = companies_technical_stats_with_founders_features(df_employees_full)
df_companies_stats_with_founders_features.head(5)

In [None]:
# Last optional step: merge the DF with new company features with the dealroom df (partially preprocessed above)
import pandas as pd
from bpideep.process_scraped_data import merge_initial_companies_with_founder
# df_full = pd.read_csv('../bpideep/rawdata/demo_data.csv')
final = merge_initial_companies_with_founder(new_data, df_companies_stats_with_founders_features)

## Checking the NaNs

In [10]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    9 non-null      int64  
 1   name                  9 non-null      object 
 2   total_funding_source  9 non-null      int64  
 3   employees             9 non-null      object 
 4   employees_latest      8 non-null      float64
 5   launch_year           9 non-null      int64  
 6   growth_stage          9 non-null      object 
 7   linkedin_url          9 non-null      object 
 8   industries            9 non-null      object 
 9   investors             9 non-null      object 
 10  nb_patents            9 non-null      int64  
 11  age                   9 non-null      int64  
 12  investors_type        9 non-null      object 
 13  health_industry       9 non-null      int64  
dtypes: float64(1), int64(6), object(7)
memory usage: 1.1+ KB


It is always best to have real data, even approximated and filled manually. 
The new function GetCleanData.get_clean_data() is built to get a Dealroom database filled with some manual imputings (LinkedIn scraping + manual info collection) for the columns needed for the model. 

NB : due to the short time we had, the growth stage imputing was included in the GetCleanData.get_clean_data(). It would be better if it is part of the pipeline so that any dataset of new observations can directly be handled by the pipeline. 

For the other cells, the pipeline will be able to fill in the missing data, using the average or most frequent data in the training set. It is allowed to have empty or NaN cells for the following fields:
 
+ number of employees (column called : "employees_latest" in Dealroom and "employees_clean" in our function)
+ age (created column called "age")
+ nb_patents

+ launch_year (column called : "launch_year" in Dealroom and "lauch_year_clean" in our function) -> if the imputing growth stage function is directly in the pipeline (not in our case so far)

If you have empty cells in any other fields, the model will throw an error, so make sure to fill them.

In our example, all 9 firms have a growth_stage indicated in Dealroom. So we can directly use the pipeline to predict their classifications.

## Using the main model to predict whether the company is deeptech

Use the model already trained on our entire dataset (1332 observations) : 

In [19]:
import dill as pickle
 
# Load pipeline from pickle file
my_pipeline = pickle.load(open("../bpideep/bpideepmodelnew.pkl","rb"))

In [None]:
# Predict the Labels using the reloaded Model
y_pred = my_pipeline.predict(final)  

## Scraping and using text data

The input needs to have the following fields:
[TODO]

Then use the following functions:

## Using the NLP model to predict whether the company is deeptech

## Making the two predictions "vote" for final prediction