![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The Pandas Library - Combining Data

*Basic initialization of the workspace.*

In [None]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [None]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

Pandas installed at version: 1.1.5


Load sample data for processing:

In [None]:
# load science and technology data frame
science_and_technology_data_frame = pd.read_csv(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Science_And_Technology.csv",
 )
print(
   "Loaded science and technology data frame with shape {}".format(
       science_and_technology_data_frame.shape
   ) 
)

Loaded science and technology data frame with shape (415, 6)


In [None]:
# load poverty data frame
poverty_data_frame = pd.read_csv(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Poverty.csv",
 )
print(
   "Loaded poverty data frame with shape {}".format(
       poverty_data_frame.shape
   ) 
)

Loaded poverty data frame with shape (262, 6)


In [None]:
# load education data frame
education_data_frame = pd.read_csv(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Education.csv",
 )
print(
   "Loaded science and technology data frame with shape {}".format(
       education_data_frame.shape
   ) 
)

Loaded science and technology data frame with shape (13485, 6)


## 1. Data Preparation

The loaded sample data frame has a series of indicators and values encoded in the data frame rows. We can create dedicated data frames that hold the indicators and values on the columns.  

### 1.1 Adding additional columns to a data frame

Additional columns to a dataframe can be simply done by referring the new column by name and setting its data:

In [None]:
# adding encoding for science and technology data frame
science_and_technology_data_frame["Topic"] = "Science&Technology"

print(
    "Sample encoded records for science and technology data \n{}".format(
      science_and_technology_data_frame.iloc[0:10]
    )
  )

Sample encoded records for science and technology data 
  Country Name Country ISO3  Year                                     Indicator Name  Indicator Code        Value               Topic
0      Romania          ROU  2020  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  886842442.5  Science&Technology
1      Romania          ROU  2019  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  936735170.4  Science&Technology
2      Romania          ROU  2018  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  962384646.5  Science&Technology
3      Romania          ROU  2017  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  911037524.1  Science&Technology
4      Romania          ROU  2016  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  832020078.0  Science&Technology
5      Romania          ROU  2015  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  834447967.1  Science&Technology
6     

In [None]:
# adding encoding for poverty frame
poverty_data_frame["Topic"] = "Poverty"

print(
    "Sample encoded records for poverty data \n{}".format(
      poverty_data_frame.iloc[0:10]
    )
  )

Sample encoded records for poverty data 
  Country Name Country ISO3  Year                                     Indicator Name     Indicator Code  Value    Topic
0      Romania          ROU  2018  Population living in slums (% of urban populat...  EN.POP.SLUM.UR.ZS   12.1  Poverty
1      Romania          ROU  2016  Population living in slums (% of urban populat...  EN.POP.SLUM.UR.ZS   14.4  Poverty
2      Romania          ROU  2018                    Income share held by second 20%     SI.DST.02ND.20   11.9  Poverty
3      Romania          ROU  2017                    Income share held by second 20%     SI.DST.02ND.20   11.7  Poverty
4      Romania          ROU  2016                    Income share held by second 20%     SI.DST.02ND.20   12.1  Poverty
5      Romania          ROU  2015                    Income share held by second 20%     SI.DST.02ND.20   11.8  Poverty
6      Romania          ROU  2014                    Income share held by second 20%     SI.DST.02ND.20   11.9  Poverty

In [None]:
# adding encoding for science and technology data frame
education_data_frame["Topic"] = "Education"

print(
    "Sample encoded records for education data \n{}".format(
      education_data_frame.iloc[0:10]
    )
  )

Sample encoded records for education data 
  Country Name Country ISO3  Year                                     Indicator Name       Indicator Code  Value      Topic
0      Romania          ROU  2010  Barro-Lee: Percentage of female population age...  BAR.NOED.1519.FE.ZS   0.88  Education
1      Romania          ROU  2005  Barro-Lee: Percentage of female population age...  BAR.NOED.1519.FE.ZS   0.63  Education
2      Romania          ROU  2000  Barro-Lee: Percentage of female population age...  BAR.NOED.1519.FE.ZS   2.24  Education
3      Romania          ROU  1995  Barro-Lee: Percentage of female population age...  BAR.NOED.1519.FE.ZS   6.02  Education
4      Romania          ROU  1990  Barro-Lee: Percentage of female population age...  BAR.NOED.1519.FE.ZS   1.30  Education
5      Romania          ROU  2010  Barro-Lee: Percentage of population age 15-19 ...     BAR.NOED.1519.ZS   0.97  Education
6      Romania          ROU  2005  Barro-Lee: Percentage of population age 15-19 ...     

### 1.2 Concatenating data frames

We would like to consolidate all the data in the loaded data frames into a single data frame - used as a single source for data processing. This means we would like to join all the rows from the dataframes into a single data frame.

This can be done via the [**concat**](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) function from Pandas, making also sure that the axis is set to 0 (rows):

In [None]:
# consolidate the data in a single data frame
# we made sure that all the source data frames 
# have the same structure
consolidated_data_frame = pd.concat(
    [science_and_technology_data_frame, poverty_data_frame, education_data_frame],
    axis = 0 
)

print(
    "The consolidated data frame has the shape {}".format(
        consolidated_data_frame.shape
    )
  )

# ensure we have all topics in the consolidated data frame
consolidated_topics = set(consolidated_data_frame["Topic"])

print(
    "The consolidated data frame has the topics {}".format(
        consolidated_topics
    )
  )

The consolidated data frame has the shape (14162, 7)
The consolidated data frame has the topics {'Science&Technology', 'Poverty', 'Education'}


Supposing that we would like to perform a data exploration on the connections between the number of technicians and number of researchers in the topic of science and technology - we would like to create several data frames where data is stored meaningfully:

In [None]:
# extract researchers information
researchers_data = consolidated_data_frame[consolidated_data_frame["Indicator Name"] == "Researchers in R&D (per million people)"]

researchers_data_frame = pd.DataFrame(
    data = {
        "Year" : pd.Series(data = researchers_data["Year"].values, dtype=np.dtype("i8")),
        "Researchers per mil. people": pd.Series(data = researchers_data["Value"].values, dtype=np.dtype("f16"))
      } 
   )

print(
    "Extracted researches data with shape {}".format(
        researchers_data_frame.shape
    )
  )

Extracted researches data with shape (23, 2)


In [None]:
# extract technicians information
technicians_data = consolidated_data_frame[consolidated_data_frame["Indicator Name"] == "Technicians in R&D (per million people)"]

technicians_data_frame = pd.DataFrame(
    data = {
        "Year" : pd.Series(data = technicians_data["Year"].values, dtype=np.dtype("i8")),
        "Technicians per mil. people": pd.Series(data = technicians_data["Value"].values, dtype=np.dtype("f16"))
      } 
   )

print(
    "Extracted technicians data frame with shape {}".format(
        technicians_data_frame.shape
    )
  )

Extracted technicians data frame with shape (22, 2)


We would like to create a data frame that contains both the informations about researchers and technicians.

This can be done by the [**merge**](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) function which matches the records by a specified key (default: the index value). It also performs the following retention of data (join type):

*  **inner** - keeps the records that match in both data frames;
*  **outer** - keeps the records from both data frames;
*  **left** - drops the records that do not match in the right hand operator data frame;
* **right** - drops the records that do not match in the left hand operator data frame.  

It can be used as follows:

In [None]:
# perform an outer join between researchers and technicians data frames
education_outer_join = pd.merge(
    researchers_data_frame, #left hand operand
    technicians_data_frame, #right hand operand
    on = "Year", #join key
    how = "outer" #join method
  )

print(
    "The education data in outer join mode is \n{}\n".format(
      education_outer_join  
    )
)

# there are no matching records for technicians information
# so data has been set to NaN (not a number constant)
# we would like to find out where data is missing 
no_data_records = education_outer_join[
  np.isnan(education_outer_join["Technicians per mil. people"])
]  
 
print(
    "The records with no information are \n {}".format(
      no_data_records
    )
)

The education data in outer join mode is 
    Year  Researchers per mil. people  Technicians per mil. people
0   2018                    882.44127                          NaN
1   2017                    891.32124                    278.82409
2   2016                    911.58518                    275.35469
3   2015                    876.22819                    267.40041
4   2014                    903.82628                    225.09562
5   2013                    922.67455                    248.00355
6   2012                    890.67001                    251.88520
7   2011                    790.68805                    252.35144
8   2010                    966.20415                    153.33240
9   2009                    933.76337                    193.38123
10  2008                    931.08256                    221.80063
11  2007                    894.16331                    207.32912
12  2006                    895.76749                    211.73285
13  2005            

Supposing we would like to retain data where information is present in both data frames, we would use the **inner** join model (even if this leads to loss of information):

In [None]:
# perform an inner join between researchers and technicians data frames
education_inner_join = pd.merge(
    researchers_data_frame, #left hand operand
    technicians_data_frame, #right hand operand
    on = "Year", #join key
    how = "inner" #join method
  )

print(
    "The education data in inner join mode is \n{}".format(
      education_inner_join  
    )
)

The education data in inner join mode is 
    Year  Researchers per mil. people  Technicians per mil. people
0   2017                    891.32124                    278.82409
1   2016                    911.58518                    275.35469
2   2015                    876.22819                    267.40041
3   2014                    903.82628                    225.09562
4   2013                    922.67455                    248.00355
5   2012                    890.67001                    251.88520
6   2011                    790.68805                    252.35144
7   2010                    966.20415                    153.33240
8   2009                    933.76337                    193.38123
9   2008                    931.08256                    221.80063
10  2007                    894.16331                    207.32912
11  2006                    895.76749                    211.73285
12  2005                   1071.93762                    233.36285
13  2004            

Since we now made sure we have no missing data, we can enhance our information by adding the ratio of technicians to researchers:

In [None]:
# perform an inner join between researchers and technicians data frames
education_inner_join["Technicians to Researchers Ratio"] = \
 education_inner_join["Technicians per mil. people"] / education_inner_join["Researchers per mil. people"]; 

print(
    "Education enhanced data sample is \n{}".format(
      education_inner_join.iloc[0:10]  
    )
)

Education enhanced data sample is 
   Year  Researchers per mil. people  Technicians per mil. people  Technicians to Researchers Ratio
0  2017                    891.32124                    278.82409                          0.312821
1  2016                    911.58518                    275.35469                          0.302061
2  2015                    876.22819                    267.40041                          0.305172
3  2014                    903.82628                    225.09562                          0.249047
4  2013                    922.67455                    248.00355                          0.268788
5  2012                    890.67001                    251.88520                          0.282804
6  2011                    790.68805                    252.35144                          0.319154
7  2010                    966.20415                    153.33240                          0.158696
8  2009                    933.76337                    193.38123