# How are the technological hubs of manufacturing evolving in the world? An analysis on the rate of innovation. 
<h2 align="center"> Pre-Work</h2>

----

### Description of Section: Data cleaning and initial analysis

In this first part of the projects, I am aiming to analyse the quality of the data retrieved and saved in the [prework section](./01_retrieving_data_and_DB_creation.ipynb) to be able to carry out further analysis on it. 
This jupyter notebook includes the following steps:
1. Reading the files
2. Basic analysis techniques from the Pandas library to analyse the quality of the data
3. Exporting the data into csv. formats. 

### The Data:

I will be using a number of different data sources for the two different questions asked in the project description. 

##### 1. Innovation analysis 
For the first part of my project, **the innovation analysis**, I will be using information of patent registration, as well as using a number of indicators provided by the World Bank Dataset.

##### 2. Market analysis
In the second part of the project,  I will attempt a **market analysis**, I will be using data from Crunchbase, as well as the database on patent regrstration from JRC-OECD COR&DIP database v.1, 2017  to do this. 


----


In [129]:
# importing basic libraries needed: 

# for database manipulation
import pandas as pd
import pandas_profiling as pp
import numpy as np
import os
from datetime import  datetime

# for visualisations:
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

# for statistics
from scipy import stats

## 1. Innovation analysis

## 2. Company analysis - market research

### Patents per company
Database documentation can be found in pdf format in the data folder

##### company lst

In [23]:
company_lst = pd.read_csv("../00_data/01_raw/patent_data/2017-COR&DIP_Company_list.txt", sep="|", index_col=0)

In [24]:
company_lst.head()

Unnamed: 0_level_0,Company_name,Ctry_Code,Worldrank,ICB3,NACE2,ISIC4_STAN38
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,ZUMTOBEL,AT,851,2730,2740,27
2,ANDRITZ,AT,892,2750,2895,28
3,AUSTRIAMICROSYSTEMS,AT,1023,9570,2611,26
4,AUSTRIA TECHNOLOGIE & SYSTEMTECHNIK,AT,1173,2730,2612,26
5,VOESTALPINE,AT,596,1750,2452,24-25


Company_id*: Unique company identifier (from 1 to 2000)*

Company_name: Company name, as listed in the 2015 Scoreboard

Ctry_code: ISO2 country code

Worldrank: From 1 to 2000, as ranked in the 2015 Scoreboard

ICB-3D: Industry sector, as listed in the 2015 Scoreboard


In [130]:
# drop unecessary columns
company_lst.drop(["NACE2", "ISIC4_STAN38"], axis=1, inplace=True)

In [141]:
# make column names more understandable 
company_lst = company_lst.rename(columns = {"Company_name":"company", "Ctry_Code":"country", "ICB3":"industry"})

In [142]:
company_lst.head()

Unnamed: 0_level_0,company,country,Worldrank,industry
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,ZUMTOBEL,AT,851,2730
2,ANDRITZ,AT,892,2750
3,AUSTRIAMICROSYSTEMS,AT,1023,9570
4,AUSTRIA TECHNOLOGIE & SYSTEMTECHNIK,AT,1173,2730
5,VOESTALPINE,AT,596,1750


##### company patents

In [147]:
company_patents = pd.read_csv("../00_data/01_raw/patent_data/2017-COR&DIP_Patent_Portfolio.txt", sep="|", index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)
  mask |= (ar1 == a)


In [162]:
company_patents.head()

Unnamed: 0_level_0,Patent_appln_id,office,date
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,380646999,EP,2013-01-14
1,380718081,EP,2013-01-22
1,380385005,EP,2013-01-22
1,380385015,EP,2013-01-22
1,380889366,EP,2013-02-01


In [151]:
# remove unecessary columns
lst = ["Patent_publn_nr", "Family_filing_date", "IP5_2_offices", "Inpadoc_family_id"]
company_patents.drop(lst, axis=1, inplace=True)

Company_id*: Unique company identifier*

Patent_appln_id: Patent application identifier (Appln_id from PATSTAT, Autumn 2016)

Publn_auth: IP5 Offices (EP, JP, KR, US, CN)


Patent_filing_date: Application date

Inpadoc_family_id: Patent family identifier (from PATSTAT, Autumn 2016)

In [152]:
# make column names more understandable
company_patents = company_patents.rename(columns = {"Publn_auth":"office", "Patent_filing_date":"date"})

In [158]:
counts = company_patents.groupby(["Company_id", "office"]).count()

In [160]:
counts.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Patent_appln_id,date
Company_id,office,Unnamed: 2_level_1,Unnamed: 3_level_1
1,CN,84,84
1,EP,263,263
1,JP,1,1
1,US,86,86
2,CN,74,74


##### company finance

In [25]:
company_info = pd.read_csv("../00_data/01_raw/patent_data/2017-COR&DIP_Company_financial.txt", sep="|", index_col=0)

In [26]:
company_info.head()

Unnamed: 0_level_0,Year,RD,NS,CAPEX,OP,EMP
Company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2011,55.071,1280.312,57.159,34.591,7456.0
1,2012,66.926,1243.616,59.509,21.659,7162.0
1,2013,71.8,1246.831,65.553,12.144,7291.0
1,2014,89.739,1312.62,76.576,41.091,7234.0
2,2011,65.641,4595.993,76.974,263.445,16750.0


Year: 2011-2014

RD: Research and Development investment (million €)

NS: Net sales (million €)

CAPEX: Capital expenditure (million €)

OP: Operating profits (million €)

EMP: Number of employees

### BONUS: Organizations, financing and descriptions  (from 2015) 

In [93]:
organizations = pd.read_csv("../00_data/01_raw/crunchbase15/companies_15.csv")

In [94]:
organizations.head()

Unnamed: 0,name,category_list,status,country_code,city,founded_at,last_funding_at
0,#fame,Media,operating,IND,Mumbai,,05/01/2015
1,:Qounter,Application Platforms|Real Time|Social Network...,operating,USA,Delaware City,04/09/2014,14/10/2014
2,"(THE) ONE of THEM,Inc.",Apps|Games|Mobile,operating,,,,30/01/2014
3,0-6.com,Curated Web,operating,CHN,Beijing,01/01/2007,19/03/2008
4,004 Technologies,Software,operating,USA,Champaign,01/01/2010,24/07/2014


###### Checking for NaNs

In [95]:
# removing compnaies with NANs, no matter what fraction of the data we loose
# we need all attributes but location to be filled for the data to be of use. 
# (we will be sorting by category and founding date) 
organizations.isna().sum()/len(organizations)*100

name                0.001507
category_list       4.743250
status              0.000000
country_code       10.483968
city               12.096191
founded_at         22.934245
last_funding_at     0.000000
dtype: float64

In [96]:
# we wil drop all NaNs but those in the location attribute
# we will drop city as we do not need it, coutnry is enough

organizations.drop("city", axis=1, inplace=True)

# create df for location to merge back in after drop nan
location = organizations[["country_code"]]

# drop country form original
organizations.drop("country_code", axis=1, inplace=True)

# drop nans 
organizations.dropna(inplace=True)

# merge countries back in
organizations = organizations.merge(location, how="left", left_index=True, right_index=True)

In [97]:
organizations.isna().sum()

name                  0
category_list         0
status                0
founded_at            0
last_funding_at       0
country_code       3183
dtype: int64

In [107]:
# checking DF attributes
print(organizations.shape)

(49710, 6)


In [99]:
organizations.head()

Unnamed: 0,name,category_list,status,founded_at,last_funding_at,country_code
1,:Qounter,Application Platforms|Real Time|Social Network...,operating,04/09/2014,14/10/2014,USA
3,0-6.com,Curated Web,operating,01/01/2007,19/03/2008,CHN
4,004 Technologies,Software,operating,01/01/2010,24/07/2014,USA
6,Ondine Biomedical Inc.,Biotechnology,operating,01/01/1997,21/12/2009,CAN
7,H2O.ai,Analytics,operating,01/01/2011,09/11/2015,USA


In [112]:
#cheking df attributes
organizations["status"].value_counts()
organizations["category_list"].value_counts()

Software                                                                                                                                         3137
Biotechnology                                                                                                                                    2431
E-Commerce                                                                                                                                        999
Mobile                                                                                                                                            883
Curated Web                                                                                                                                       794
Clean Technology                                                                                                                                  731
Hardware + Software                                                                                 

In [116]:
organizations.dtypes

name               object
category_list      object
status             object
founded_at         object
last_funding_at    object
country_code       object
dtype: object

In [124]:
organizations[organizations["founded_at"]=="1899-12-31"]

Unnamed: 0,name,category_list,status,founded_at,last_funding_at,country_code
1879,AG&P,Clean Technology,operating,1899-12-31,02/07/2013,PHL
6153,Becker College,Education,operating,1899-12-31,13/09/2013,USA
9898,Carnegie Mellon University,Education,operating,1899-12-31,02/09/2014,USA
60561,University of Chicago,Education,operating,1899-12-31,06/01/2014,USA


In [119]:
pd.to_datetime(organizations["founded_at"], format="%d/%m/%Y")

ValueError: time data '1899-12-31' does not match format '%d/%m/%Y' (match)

##### Organisations extended (up to 2015) (for market research)

## 3. Organising Database


### Description of section: 
- Importing Datasets
- Organising Database
- Creating Database connection and exporting Datasets to cloud


The prework of the Project consists basically on the organisation of the data storage and the creation of the tables that I will be using through this Analysis. 

Goal of this part is to describe the Database, upload the credentials needed to access it as a guest, and uploading the Data to the cloud. 

I will be using a number of different datasets for the analysis of the project.

The links and organisation between these can be seen below: 

![DB_diagram](../02_visualisations/Database_org.PNG)

> Note that the tables country and industry are only used as key-holders for extra information. I will not be analysing the informaiton in them directly. 

## 3. Exporing Data into cloud

##### DataBase guest credentials:

<div class="alert alert-block alert-info">
<b>Credentials for DB:</b> 

##### To access the database, see the details below:
Note that this database has an IP restriction, and can therefore only be viewed in the IRONHACK campus. If the DB needs to be accessed from a different  IP address, please contact me. 

>**Connection information:** <br>
>**User name**: ironhack <br> 
>**Passowrd**: Ironhack1 <br>
>**Host name**: 35.240.116.117 <br>
</div>