# How are the technological hubs of manufacturing evolving in the world? An analysis on the rate of innovation. 
<h2 align="center"> Pre-Work</h2>

----

### Description of Section: Data cleaning and initial analysis

In this first part of the projects, I am aiming to analyse the quality of the data retrieved and saved in the [prework section](./01_retrieving_data_and_DB_creation.ipynb) to be able to carry out further analysis on it. 
This jupyter notebook includes the following steps:
1. Reading the files
2. Basic analysis techniques from the Pandas library to analyse the quality of the data
3. Exporting the data into csv. formats. 

### The Data:

I will be using a number of different data sources for the two different questions asked in the project description. 

##### 1. Innovation analysis 
For the first part of my project, **the innovation analysis**, I will be using information of patent registration, both [by country](), as well as [by company](). Added to this I will also be using a number of indicators provided by the [World Bank Dataset]() to measure innovation. 

Specific datasets used, such as the Industry Classification Benchamark indices, where imported form a self made csv file. The information was extracted form [Wikipedia.](https://en.wikipedia.org/wiki/Industry_Classification_Benchmark)


##### 2. Market analysis
In the second part of the project,  I will attempt a **market analysis**, I will be using data from [Crunchbase](), as well as the database on patent regrstration from [JRC-OECD COR&DIP database v.1](), 2017  to do this. 


----


In [1]:
# importing basic libraries needed: 

# for database manipulation
import pandas as pd
import pandas_profiling as pp
import numpy as np
import os
from datetime import  datetime

# for visualisations:
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

# for statistics
from scipy import stats

## Basis:
First the datasets that contain the primary keys will be created and modified accordingly to maintain a consistent nomeclature throught the project. 

This includes the country dataframe and the industry dataframe. 

The country dataframe is based on the World Bank report, and inlcudes:
- Country Code
- Country Name
- Income Group

The industry table, as mentioned above, provides from the  Industry Classification Benchamark indices. 

##### 1. Countries table:

In [236]:
country = pd.read_csv("../00_data/01_raw/patent_data/world_bank_countires.csv", index_col="Country Code")

In [237]:
country.head()

Unnamed: 0_level_0,Country Name,Region,IncomeGroup,SpecialNotes
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ABW,Aruba,Latin America & Caribbean,High income,
AFG,Afghanistan,South Asia,Low income,
AGO,Angola,Sub-Saharan Africa,Lower middle income,
ALB,Albania,Europe & Central Asia,Upper middle income,
AND,Andorra,Europe & Central Asia,High income,


In [238]:
# drop specialNotes as its not useful 
country.drop("SpecialNotes", axis=1, inplace=True)

In [241]:
# change names of columns for consistency:
country = country.rename(columns = {"Country Name":"country name", "Region":"region", "IncomeGroup":"income_group"})

In [242]:
# save for later use











##### 2. Industries table:

In [227]:
# industry in ICB (industry classification benchamerk) format, needs to be changed to better understand 
Industry = pd.read_csv("../00_data/01_raw/patent_data/ICBs.csv", index_col="ICB")
Industry.head()

Unnamed: 0_level_0,industry
ICB,Unnamed: 1_level_1
530,Oil & Gas Producers
570,"Oil Equipment, Services & Distribution"
580,Alternative Energy
1350,Chemicals
1730,Forestry & Paper


In [229]:
# reset index for future merge on ICB column
industry = Industry.reset_index()

Unnamed: 0,ICB,industry
0,530,Oil & Gas Producers
1,570,"Oil Equipment, Services & Distribution"
2,580,Alternative Energy
3,1350,Chemicals
4,1730,Forestry & Paper


In [None]:
# save to clean data Industry table:











## 1. Innovation analysis

##### Patents by country

In [150]:
patents_by_country = pd.read_csv("../00_data/01_raw/patent_data/patents_by_country_and_technology.csv")

In [152]:
patents_by_country.head()

Unnamed: 0,Origin,Origin (Code),Field of technology,1980,1981,1982,1983,1984,1985,1986,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Afghanistan,AF,Unknown,,,,,,,,...,,,,,,,,,,1.0
1,Afghanistan,AF,"1 - Electrical machinery, apparatus, energy",,,,,,,,...,,,,,,1.0,1.0,,,
2,Afghanistan,AF,2 - Audio-visual technology,,,,,,,,...,1.0,,,,,,,,,
3,Afghanistan,AF,5 - Basic communication processes,,,,,,,,...,,,,,,,,1.0,,
4,Afghanistan,AF,6 - Computer technology,,,,,,,,...,1.0,,1.0,2.0,2.0,1.0,5.0,6.0,5.0,5.0


In [160]:
# fill nans with 0 to complete table, the rest of the table will be left like this to use in further analysis
patents_by_country = patents_by_country.fillna(0)

In [165]:
# change names of columns for consistency 
patents_by_country = patents_by_country.rename(columns = {"Origin":"country name", "Origin (Code)":"country", "Field of technology":"industry"})

In [None]:
# change country indexing to 3 letter naming for consistency:







# make into dictionary and replace?








In [233]:
# changing the industry label for consistency:
patents_by_country["industry"].value_counts()

# industry naming not consistent with current naming, nor is it translatable. For the moment we are going to leave it liek this.

35 - Civil engineering                         161
33 - Furniture, games                          159
29 - Other special machines                    156
32 - Transport                                 156
27 - Engines, pumps, turbines                  154
1 - Electrical machinery, apparatus, energy    153
16 - Pharmaceuticals                           152
19 - Basic materials chemistry                 151
23 - Chemical engineering                      149
13 - Medical technology                        149
25 - Handling                                  144
18 - Food chemistry                            143
10 - Measurement                               143
6 - Computer technology                        143
34 - Other consumer goods                      143
Unknown                                        141
26 - Machine tools                             139
14 - Organic fine chemistry                    139
12 - Control                                   138
30 - Thermal processes and appa

In [168]:
# grouping by country for sum of patents per country:
patents_grouped = patents_by_country.groupby("country").sum()
patents_grouped.head()

Unnamed: 0_level_0,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,2.0,4.0,2.0,9.0,16.0,10.0,9.0,4.0,2.0
AD,1.0,2.0,2.0,1.0,2.0,1.0,3.0,4.0,1.0,3.0,...,7.0,15.0,14.0,3.0,15.0,6.0,18.0,6.0,12.0,5.0
AE,1.0,2.0,2.0,2.0,1.0,1.0,0.0,2.0,1.0,3.0,...,20.0,23.0,29.0,37.0,54.0,65.0,81.0,87.0,139.0,180.0
AF,0.0,0.0,1.0,2.0,3.0,0.0,1.0,2.0,1.0,7.0,...,73.0,37.0,29.0,17.0,22.0,49.0,54.0,84.0,79.0,100.0
AG,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,2.0,0.0,0.0,4.0,4.0,2.0,3.0,4.0,1.0


In [None]:
# save data for future use











#####  Innovation rate of countries

In [169]:
innovation_markers = pd.read_csv("../00_data/01_raw/patent_data/world_bank_innovation_rate.csv")

In [170]:
innovation_markers.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Aruba,ABW,High-technology exports (% of manufactured exp...,TX.VAL.TECH.MF.ZS,,,,,,,...,3.213858,3.292313,4.052655,10.2401,4.915729,5.449211,4.694768,3.758736,5.380172,
1,Aruba,ABW,High-technology exports (current US$),TX.VAL.TECH.CD,,,,,,,...,866862.0,560538.0,1073498.0,3325655.0,1627152.0,1663353.0,1194859.0,1158612.0,1654810.0,
2,Aruba,ABW,Technicians in R&D (per million people),SP.POP.TECH.RD.P6,,,,,,,...,,,,,,,,,,
3,Aruba,ABW,Researchers in R&D (per million people),SP.POP.SCIE.RD.P6,,,,,,,...,,,,,,,,,,
4,Aruba,ABW,"Trademark applications, total",IP.TMK.TOTL,,,,,,,...,,,,,,,,,,


In [171]:
# fill nans with 0 to complete table
innovation_markers = innovation_markers.fillna(0)

In [172]:
# remove unnecesary colum: indicator code
innovation_markers.drop("Indicator Code", axis=1, inplace=True)

In [173]:
# change names of columns for consistency 
innovation_markers = innovation_markers.rename(columns = {"Country Name":"country name", "Country Code":"country", "Indicator Name":"indicator"})

In [179]:
# see if more work on this is needed!!











<pandas.core.indexing._LocIndexer at 0x1e425aea9f8>

## 2. Company analysis - market research

### Patents per company
Database documentation can be found in pdf format in the data folder

##### Company lst

In [137]:
company_lst = pd.read_csv("../00_data/01_raw/patent_data/2017-COR&DIP_Company_list.txt", sep="|")

In [138]:
company_lst.head()

Unnamed: 0,Company_id,Company_name,Ctry_Code,Worldrank,ICB3,NACE2,ISIC4_STAN38
0,1,ZUMTOBEL,AT,851,2730,2740,27
1,2,ANDRITZ,AT,892,2750,2895,28
2,3,AUSTRIAMICROSYSTEMS,AT,1023,9570,2611,26
3,4,AUSTRIA TECHNOLOGIE & SYSTEMTECHNIK,AT,1173,2730,2612,26
4,5,VOESTALPINE,AT,596,1750,2452,24-25


Company_id*: Unique company identifier (from 1 to 2000)*

Company_name: Company name, as listed in the 2015 Scoreboard

Ctry_code: ISO2 country code

Worldrank: From 1 to 2000, as ranked in the 2015 Scoreboard

ICB-3D: Industry sector, as listed in the 2015 Scoreboard


In [139]:
# drop unecessary columns
company_lst.drop(["NACE2", "ISIC4_STAN38"], axis=1, inplace=True)

In [140]:
# make column names more understandable 
company_lst = company_lst.rename(columns = {"Company_name":"company", "Ctry_Code":"country", "Worldrank":"worldrank", "ICB3":"ICB"})

In [141]:
company_lst.head()

Unnamed: 0,Company_id,company,country,worldrank,ICB
0,1,ZUMTOBEL,AT,851,2730
1,2,ANDRITZ,AT,892,2750
2,3,AUSTRIAMICROSYSTEMS,AT,1023,9570
3,4,AUSTRIA TECHNOLOGIE & SYSTEMTECHNIK,AT,1173,2730
4,5,VOESTALPINE,AT,596,1750


In [143]:
# merge ICB industry name of index, then drop ICB index column
company_lst= company_lst.merge(df, how="left", on="ICB").set_index("Company_id")
#company_lst.drop("ICB", axis=1, inplace=True)
company_lst.head()

Unnamed: 0_level_0,company,country,worldrank,ICB,industry
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,ZUMTOBEL,AT,851,2730,Electronic & Electrical Equipment
2,ANDRITZ,AT,892,2750,Industrial Engineering
3,AUSTRIAMICROSYSTEMS,AT,1023,9570,Technology Hardware & Equipment
4,AUSTRIA TECHNOLOGIE & SYSTEMTECHNIK,AT,1173,2730,Electronic & Electrical Equipment
5,VOESTALPINE,AT,596,1750,Industrial Metals & Mining


In [144]:
# as we will later on merge in the count of patents per company, we do not yet save to csv.

##### Company patents

In [49]:
company_patents = pd.read_csv("../00_data/01_raw/patent_data/2017-COR&DIP_Patent_Portfolio.txt", sep="|", index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)
  mask |= (ar1 == a)


In [50]:
company_patents.head()

Unnamed: 0_level_0,Patent_appln_id,Publn_auth,Patent_publn_nr,Patent_filing_date,Inpadoc_family_id,Family_filing_date,IP5_2_offices
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,380646999,EP,2805106,2013-01-14,9215041,2012-01-16,0
1,380718081,EP,2807419,2013-01-22,9222565,2012-01-26,0
1,380385005,EP,2620923,2013-01-22,9222808,2012-01-27,0
1,380385015,EP,2620936,2013-01-22,9222811,2012-01-24,0
1,380889366,EP,2810535,2013-02-01,9232760,2012-02-03,1


In [51]:
# remove unecessary columns
lst = ["Patent_publn_nr", "Family_filing_date", "IP5_2_offices", "Inpadoc_family_id"]
company_patents.drop(lst, axis=1, inplace=True)

Company_id*: Unique company identifier*

Patent_appln_id: Patent application identifier (Appln_id from PATSTAT, Autumn 2016)

Publn_auth: IP5 Offices (EP, JP, KR, US, CN)


Patent_filing_date: Application date

Inpadoc_family_id: Patent family identifier (from PATSTAT, Autumn 2016)

In [52]:
# make column names more understandable
company_patents = company_patents.rename(columns = {"Publn_auth":"office", "Patent_filing_date":"date"})

In [53]:
company_patents.head()

Unnamed: 0_level_0,Patent_appln_id,office,date
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,380646999,EP,2013-01-14
1,380718081,EP,2013-01-22
1,380385005,EP,2013-01-22
1,380385015,EP,2013-01-22
1,380889366,EP,2013-02-01


In [184]:
#checkign attributes
company_patents.dtypes

Patent_appln_id     int64
office             object
date               object
dtype: object

In [187]:
# changing date into datetime 
company_patents["date"] =  pd.to_datetime(company_patents["date"], format="%Y/%m/%d")

In [54]:
# save as cleaned file for later use
#company_patents.to_csv("../00_clean_datasets/company_patents.csv")














In [55]:
# count of number of patents per office and company
# only one column is needed as we do a count 
# the count is going to be mergerd into the company_lst table to have all data in one set
counts = company_patents.groupby(["Company_id", "office"]).count()[["Patent_appln_id"]]
counts = counts.rename(columns = {"Patent_appln_id":"count"})

In [56]:
counts.head()
# this will be merged into the company list dataframe to obtain one table with all the inofmration for further manipulaiton of the set. 

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Company_id,office,Unnamed: 2_level_1
1,CN,84
1,EP,263
1,JP,1
1,US,86
2,CN,74


##### Merging into one dataset for next analysis

In [145]:
# merge the count list of patents per office onto the company list. 
company_lst = company_lst.merge(counts, how="inner", left_index=True, right_index=True)

In [146]:
company_lst.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,company,country,worldrank,ICB,industry,count
Company_id,office,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,CN,ZUMTOBEL,AT,851,2730,Electronic & Electrical Equipment,84
1,EP,ZUMTOBEL,AT,851,2730,Electronic & Electrical Equipment,263
1,JP,ZUMTOBEL,AT,851,2730,Electronic & Electrical Equipment,1
1,US,ZUMTOBEL,AT,851,2730,Electronic & Electrical Equipment,86
2,CN,ANDRITZ,AT,892,2750,Industrial Engineering,74


In [148]:
# saving cleaned table for later
#company_lst.to_csv("../00_clean_datasets/company_list.csv")


















##### Company finance (extra infromation for later)

In [188]:
company_info = pd.read_csv("../00_data/01_raw/patent_data/2017-COR&DIP_Company_financial.txt", sep="|", index_col=0)

In [189]:
company_info.head()

Unnamed: 0_level_0,Year,RD,NS,CAPEX,OP,EMP
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2011,55.071,1280.312,57.159,34.591,7456.0
1,2012,66.926,1243.616,59.509,21.659,7162.0
1,2013,71.8,1246.831,65.553,12.144,7291.0
1,2014,89.739,1312.62,76.576,41.091,7234.0
2,2011,65.641,4595.993,76.974,263.445,16750.0


Year: 2011-2014

RD: Research and Development investment (million €)

NS: Net sales (million €)

CAPEX: Capital expenditure (million €)

OP: Operating profits (million €)

EMP: Number of employees

In [190]:
# changing the names of variables for better understanding 
company_info = company_info.rename(columns = {"RD":"R&D_investemetn", "NS":"net_sales", "OP":"operating_profits", "EMP":"employees"})

In [191]:
# removing CAPEX as not really needed:
company_info.drop("CAPEX", axis=1, inplace=True)

In [192]:
# saving cleaned table for later
#company_info.to_csv("../00_clean_datasets/company_information.csv")












### BONUS: Organizations, financing and descriptions  (from 2015) 

In [193]:
organizations = pd.read_csv("../00_data/01_raw/crunchbase15/companies_15.csv")

In [194]:
organizations.head()

Unnamed: 0,name,category_list,status,country_code,city,founded_at,last_funding_at
0,#fame,Media,operating,IND,Mumbai,,05/01/2015
1,:Qounter,Application Platforms|Real Time|Social Network...,operating,USA,Delaware City,04/09/2014,14/10/2014
2,"(THE) ONE of THEM,Inc.",Apps|Games|Mobile,operating,,,,30/01/2014
3,0-6.com,Curated Web,operating,CHN,Beijing,01/01/2007,19/03/2008
4,004 Technologies,Software,operating,USA,Champaign,01/01/2010,24/07/2014


###### Checking for NaNs

In [195]:
# removing compnaies with NANs, no matter what fraction of the data we loose
# we need all attributes but location to be filled for the data to be of use. 
# (we will be sorting by category and founding date) 
organizations.isna().sum()/len(organizations)*100

name                0.001507
category_list       4.743250
status              0.000000
country_code       10.483968
city               12.096191
founded_at         22.934245
last_funding_at     0.000000
dtype: float64

In [196]:
# we wil drop all NaNs but those in the location attribute
# we will drop city as we do not need it, coutnry is enough

organizations.drop("city", axis=1, inplace=True)

# create df for location to merge back in after drop nan
location = organizations[["country_code"]]

# drop country form original
organizations.drop("country_code", axis=1, inplace=True)

# drop nans 
organizations.dropna(inplace=True)

# merge countries back in
organizations = organizations.merge(location, how="left", left_index=True, right_index=True)

In [197]:
organizations.isna().sum()

name                  0
category_list         0
status                0
founded_at            0
last_funding_at       0
country_code       3183
dtype: int64

##### checking dataframe attributes and types

In [198]:
print(organizations.shape)

(49710, 6)


In [199]:
organizations.head()

Unnamed: 0,name,category_list,status,founded_at,last_funding_at,country_code
1,:Qounter,Application Platforms|Real Time|Social Network...,operating,04/09/2014,14/10/2014,USA
3,0-6.com,Curated Web,operating,01/01/2007,19/03/2008,CHN
4,004 Technologies,Software,operating,01/01/2010,24/07/2014,USA
6,Ondine Biomedical Inc.,Biotechnology,operating,01/01/1997,21/12/2009,CAN
7,H2O.ai,Analytics,operating,01/01/2011,09/11/2015,USA


In [200]:
#cheking df attributes
organizations["category_list"].value_counts()

Software                                                                                                     3137
Biotechnology                                                                                                2431
E-Commerce                                                                                                    999
Mobile                                                                                                        883
Curated Web                                                                                                   794
Clean Technology                                                                                              731
Hardware + Software                                                                                           699
Enterprise Software                                                                                           663
Health Care                                                                             

In [201]:
organizations.dtypes

name               object
category_list      object
status             object
founded_at         object
last_funding_at    object
country_code       object
dtype: object

In [None]:
# changing last_funding_at into datetime
organizations["last_funding_at"] = pd.to_datetime(organizations["last_funding_at"], format="%d/%m/%Y")

In [223]:
# check for inconsistent date format
organizations[organizations["founded_at"]=="1899-12-31"]
# drop observations that fall under education (not in our main interest)


Unnamed: 0,name,category_list,status,founded_at,last_funding_at,country_code
1879,AG&P,Clean Technology,operating,1899-12-31,2013-07-02,PHL
6153,Becker College,Education,operating,1899-12-31,2013-09-13,USA
9898,Carnegie Mellon University,Education,operating,1899-12-31,2014-09-02,USA
60561,University of Chicago,Education,operating,1899-12-31,2014-01-06,USA


In [243]:
#changing founded_at into datetime
#organizations["founded_at"] = pd.to_datetime(organizations["founded_at"], format="%d/%m/%Y")











## 3. Organising Database


### Description of section: 
- Importing Datasets
- Organising Database
- Creating Database connection and exporting Datasets to cloud


The prework of the Project consists basically on the organisation of the data storage and the creation of the tables that I will be using through this Analysis. 

Goal of this part is to describe the Database, upload the credentials needed to access it as a guest, and uploading the Data to the cloud. 

I will be using a number of different datasets for the analysis of the project.

The links and organisation between these can be seen below: 

![DB_diagram](../02_visualisations/Database_org.PNG)

> Note that the tables country and industry are only used as key-holders for extra information. I will not be analysing the informaiton in them directly. 

## 3. Exporing Data into cloud

In [None]:
#export tables:
industry (ICB)
innovation_markers (world bank)
organizations (crunchbase)
country (world bank)
company_lst (?)

##### DataBase guest credentials:

<div class="alert alert-block alert-info">
<b>Credentials for DB:</b> 

##### To access the database, see the details below:
Note that this database has an IP restriction, and can therefore only be viewed in the IRONHACK campus. If the DB needs to be accessed from a different  IP address, please contact me. 

>**Connection information:** <br>
>**User name**: ironhack <br> 
>**Passowrd**: Ironhack1 <br>
>**Host name**: 35.240.116.117 <br>
</div>