# **Feature Engineering**

## Objectives

* Will take the cleaned dataset and check if there are any outliers for each country. Will also make a new dataframe that has the average rate of increase for each country

## Inputs

* cleaned_data.csv 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone'

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [5]:
# Load the cleaned data
cleaned_data_path = 'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone\\cleaned_data.csv'
cleaned_data = pd.read_csv(cleaned_data_path)

# Display the first few rows of the dataframe
print(cleaned_data.head())

  Country ISO3    Country  Year ISIC Division  \
0          ARG  Argentina  2000         10-12   
1          ARG  Argentina  2000         13-15   
2          ARG  Argentina  2000            16   
3          ARG  Argentina  2000         17-18   
4          ARG  Argentina  2000            19   

                                  ISIC Division Desc         Measure  \
0  Manufacture of food products, beverages and to...  Current prices   
1  Manufacture of textiles, wearing apparel, leat...  Current prices   
2  Manufacture of wood and of products of wood an...  Current prices   
3         Manufacture of paper products and printing  Current prices   
4  Manufacture of coke and refined petroleum prod...  Current prices   

   Value LCU (million national currency)  Value USD (million USD)  
0                               16986.10                  16994.6  
1                                5944.13                   5947.1  
2                                1551.72                   1552.5  


---

## Feature engineering

Now let us adjust some of the column names we have to make it easier for our analysis

In [11]:
# Make a copy of cleaned_data
cleaned_data2 = cleaned_data.copy()

# Rename columns
cleaned_data2.rename(columns={
    'Country ISO3': 'ISO3',
    'ISIC Division': 'Ind Code',
    'ISIC Division Desc': 'Industry',
    'Value LCU (million national currency)': 'Value (LCU)',
    'Value USD (million USD)': 'Value (USD)'
}, inplace=True)

print(cleaned_data2.head())

  ISO3    Country  Year Ind Code  \
0  ARG  Argentina  2000    10-12   
1  ARG  Argentina  2000    13-15   
2  ARG  Argentina  2000       16   
3  ARG  Argentina  2000    17-18   
4  ARG  Argentina  2000       19   

                                            Industry         Measure  \
0  Manufacture of food products, beverages and to...  Current prices   
1  Manufacture of textiles, wearing apparel, leat...  Current prices   
2  Manufacture of wood and of products of wood an...  Current prices   
3         Manufacture of paper products and printing  Current prices   
4  Manufacture of coke and refined petroleum prod...  Current prices   

   Value (LCU)  Value (USD)  
0     16986.10      16994.6  
1      5944.13       5947.1  
2      1551.72       1552.5  
3      4567.81       4570.1  
4      2938.13       2939.6  


In [12]:
unique_measures = cleaned_data2['Industry'].unique()
print(unique_measures)

['Manufacture of food products, beverages and tobacco products'
 'Manufacture of textiles, wearing apparel, leather and related products'
 'Manufacture of wood and of products of wood and cork, except furniture; manufacture of articles of straw and plaiting materials'
 'Manufacture of paper products and printing'
 'Manufacture of coke and refined petroleum products'
 'Manufacturing of chemicals and pharmaceutical products'
 'Manufacture of rubber and plastics products'
 'Manufacture of other non-metallic mineral products'
 'Manufacture of basic metals'
 'Manufacture of fabricated metal products, machinery and equipment'
 'Manufacture of transport equipment'
 'Manufacture of furniture, other manufacturing']


Looking at the data, there isn't any obvious short form words or abbreviations I can use to describe each industry so I will just be using the letters A-L, and the user can refer to a reference list to see what each letter represents

In [14]:
unique_measures = cleaned_data2['Ind Code'].unique()
print(unique_measures)

['10-12' '13-15' '16' '17-18' '19' '20-21' '22' '23' '24' '25-28' '29-30'
 '31-32']


In [15]:
# Mapping dictionary
ind_code_mapping = {
    '10-12': 'A',
    '13-15': 'B',
    '16': 'C',
    '17-18': 'D',
    '19': 'E',
    '20-21': 'F',
    '22': 'G',
    '23': 'H',
    '24': 'I',
    '25-28': 'J',
    '29-30': 'K',
    '31-32': 'L'
}

# Replace 'Div Code' values
cleaned_data2['Ind Code'] = cleaned_data2['Ind Code'].replace(ind_code_mapping)

print(cleaned_data2.head())

  ISO3    Country  Year Ind Code  \
0  ARG  Argentina  2000        A   
1  ARG  Argentina  2000        B   
2  ARG  Argentina  2000        C   
3  ARG  Argentina  2000        D   
4  ARG  Argentina  2000        E   

                                            Industry         Measure  \
0  Manufacture of food products, beverages and to...  Current prices   
1  Manufacture of textiles, wearing apparel, leat...  Current prices   
2  Manufacture of wood and of products of wood an...  Current prices   
3         Manufacture of paper products and printing  Current prices   
4  Manufacture of coke and refined petroleum prod...  Current prices   

   Value (LCU)  Value (USD)  
0     16986.10      16994.6  
1      5944.13       5947.1  
2      1551.72       1552.5  
3      4567.81       4570.1  
4      2938.13       2939.6  


In [16]:
unique_measures = cleaned_data2['Ind Code'].unique()
print(unique_measures)

['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L']


I want to add a new column that shows indicates which continent the country is from. This will be useful later when we want to compare for example Europe vs America 

In [17]:
cleaned_data2['Country'].unique()

array(['Argentina', 'Australia', 'Austria', 'Belgium', 'Brazil',
       'Brunei Darussalam', 'Bulgaria', 'Cambodia', 'Canada', 'Chile',
       'Chinese Taipei', 'Colombia', 'Costa Rica', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France',
       'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Ireland', 'Italy', 'Japan', 'Kazakhstan', 'Korea',
       'Laos', 'Latvia', 'Lithuania', 'Luxembourg', 'Malaysia', 'Malta',
       'Mexico', 'Morocco', 'Myanmar', 'New Zealand', 'North Macedonia',
       'Norway', "People's Republic of China", 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Romania', 'Russian Federation',
       'Saudi Arabia', 'Serbia', 'Singapore', 'Slovak Republic',
       'Slovenia', 'South Africa', 'Spain', 'Sweden', 'Switzerland',
       'Thailand', 'The Netherlands', 'Tunisia', 'Turkey',
       'United Kingdom', 'United States', 'Viet Nam'], dtype=object)

In [18]:
# Mapping dictionary for countries to continents
country_to_continent = {
    'Argentina': 'South America', 
    'Australia': 'Oceania', 
    'Austria': 'Europe', 
    'Belgium': 'Europe', 
    'Brazil': 'South America',
    'Brunei Darussalam': 'Asia', 
    'Bulgaria': 'South America', 
    'Cambodia': 'South America', 
    'Canada': 'North America', 
    'Chile': 'South America',
    'Chinese Taipei': 'Asia', 
    'Colombia': 'South America', 
    'Costa Rica': 'South America', 
    'Croatia': 'Europe', 
    'Cyprus': 'Europe',
    'Czech Republic': 'Europe', 
    'Denmark': 'Europe', 
    'Estonia': 'Europe', 
    'Finland': 'Europe', 
    'France': 'Europe',
    'Germany': 'Europe', 
    'Greece': 'Europe', 
    'Hong Kong': 'Asia', 
    'Hungary': 'Europe', 
    'Iceland': 'Europe', 
    'India': 'Asia',
    'Indonesia': 'Asia', 
    'Ireland': 'Europe', 
    'Italy': 'Europe', 
    'Japan': 'Asia', 
    'Kazakhstan': 'Asia',
    'Korea': 'Asia', 
    'Laos': 'Asia', 
    'Latvia': 'Europe', 
    'Lithuania': 'Europe', 
    'Luxembourg': 'Europe', 
    'Malaysia': 'Asia',
    'Malta': 'Europe', 
    'Mexico': 'South America', 
    'Morocco': 'Africa', 
    'Myanmar': 'Asia', 
    'New Zealand': 'Oceania',
    'North Macedonia': 'Europe', 
    'Norway': 'Europe', 
    "People's Republic of China": 'Asia', 
    'Peru': 'South America',
    'Philippines': 'Asia', 
    'Poland': 'Europe', 
    'Portugal': 'Europe', 
    'Romania': 'Europe',
    'Russian Federation': 'Europe', 
    'Saudi Arabia': 'Middle East', 
    'Serbia': 'Europe', 
    'Singapore': 'Asia',
    'Slovak Republic': 'Europe', 
    'Slovenia': 'Europe', 
    'South Africa': 'Africa', 
    'Spain': 'Europe', 
    'Sweden': 'Europe',
    'Switzerland': 'Europe', 
    'Thailand': 'Asia', 
    'The Netherlands': 'Europe', 
    'Tunisia': 'Africa', 
    'Turkey': 'Middle East',
    'United Kingdom': 'Europe', 
    'United States': 'North America', 
    'Viet Nam': 'Asia'
}

# Add 'Continent' column to cleaned_data2
cleaned_data2['Continent'] = cleaned_data2['Country'].map(country_to_continent)

print(cleaned_data2.head())

  ISO3    Country  Year Ind Code  \
0  ARG  Argentina  2000        A   
1  ARG  Argentina  2000        B   
2  ARG  Argentina  2000        C   
3  ARG  Argentina  2000        D   
4  ARG  Argentina  2000        E   

                                            Industry         Measure  \
0  Manufacture of food products, beverages and to...  Current prices   
1  Manufacture of textiles, wearing apparel, leat...  Current prices   
2  Manufacture of wood and of products of wood an...  Current prices   
3         Manufacture of paper products and printing  Current prices   
4  Manufacture of coke and refined petroleum prod...  Current prices   

   Value (LCU)  Value (USD)      Continent  
0     16986.10      16994.6  South America  
1      5944.13       5947.1  South America  
2      1551.72       1552.5  South America  
3      4567.81       4570.1  South America  
4      2938.13       2939.6  South America  


Let re-order the columns to make it look more logical

In [20]:
# Order of the columns
desired_order = ['Year','Country', 'ISO3', 'Continent', 'Ind Code', 'Industry','Value (LCU)','Value (USD)']

# Reorder the columns
cleaned_data2 = cleaned_data2[desired_order]

# Print the DataFrame to verify the new order
print(cleaned_data2.head())

   Year    Country ISO3      Continent Ind Code  \
0  2000  Argentina  ARG  South America        A   
1  2000  Argentina  ARG  South America        B   
2  2000  Argentina  ARG  South America        C   
3  2000  Argentina  ARG  South America        D   
4  2000  Argentina  ARG  South America        E   

                                            Industry  Value (LCU)  Value (USD)  
0  Manufacture of food products, beverages and to...     16986.10      16994.6  
1  Manufacture of textiles, wearing apparel, leat...      5944.13       5947.1  
2  Manufacture of wood and of products of wood an...      1551.72       1552.5  
3         Manufacture of paper products and printing      4567.81       4570.1  
4  Manufacture of coke and refined petroleum prod...      2938.13       2939.6  


---

We want to make a new dataframe which has the total value for each industry globally

In [22]:
# Group by 'Ind Code' and aggregate the sum of 'Value LCU (million national currency)' and 'Value USD (million USD)'
industry_total = cleaned_data2.groupby(['Ind Code', 'Industry']).agg({
    'Value (LCU)': 'sum',
    'Value (USD)': 'sum'
}).reset_index()

# Rename the columns
industry_total.columns = ['Ind Code', 'Industry', 'LCU', 'USD']

print(industry_total.head())

  Ind Code                                           Industry           LCU  \
0        A  Manufacture of food products, beverages and to...  1.155175e+10   
1        B  Manufacture of textiles, wearing apparel, leat...  6.641122e+09   
2        C  Manufacture of wood and of products of wood an...  1.164521e+09   
3        D         Manufacture of paper products and printing  2.177163e+09   
4        E  Manufacture of coke and refined petroleum prod...  4.731489e+09   

           USD  
0  40996209.07  
1  11975995.30  
2   5110883.85  
3  13757468.59  
4  15546318.36  


The rate of growth year by year for each industry in each country will be left for Tableau, as it's easier to show these measures there.  

In [23]:
# Export cleaned_data2 to CSV
cleaned_data2.to_csv('cleaned_data2.csv', index=False)

# Export industry_total to CSV
industry_total.to_csv('industry_total.csv', index=False)