# **Feature Engineering**

## Objectives

* Will take the cleaned dataset and check if there are any outliers for each country. Will also make a new dataframe that has the average rate of increase for each country

## Inputs

* cleaned_data.csv 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone'

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Load the cleaned data
cleaned_data_path = 'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone\\cleaned_data.csv'
cleaned_data = pd.read_csv(cleaned_data_path)

# Display the first few rows of the dataframe
print(cleaned_data.head())

  Country ISO3    Country  Year ISIC Division  \
0          ARG  Argentina  2000         10-12   
1          ARG  Argentina  2000         13-15   
2          ARG  Argentina  2000            16   
3          ARG  Argentina  2000         17-18   
4          ARG  Argentina  2000            19   

                                  ISIC Division Desc         Measure  \
0  Manufacture of food products, beverages and to...  Current prices   
1  Manufacture of textiles, wearing apparel, leat...  Current prices   
2  Manufacture of wood and of products of wood an...  Current prices   
3         Manufacture of paper products and printing  Current prices   
4  Manufacture of coke and refined petroleum prod...  Current prices   

   Value LCU (million national currency)  Value USD (million USD)  
0                               16986.10                  16994.6  
1                                5944.13                   5947.1  
2                                1551.72                   1552.5  


---

## Feature engineering

Now let us adjust some of the column names we have to make it easier for our analysis. We want to change ISO3 to 'Country code', 'ISIC Division' to 'Div Code', and 'ISIC Division description' to 'Division'

In [27]:
# Make a copy of cleaned_data
cleaned_data2 = cleaned_data.copy()

# Rename columns
cleaned_data2.rename(columns={
    'Country ISO3': 'Country Code',
    'ISIC Division': 'Ind Code',
    'ISIC Division Desc': 'Industry'
}, inplace=True)

print(cleaned_data2.head())

  Country Code    Country  Year Ind Code  \
0          ARG  Argentina  2000        A   
1          ARG  Argentina  2000        B   
2          ARG  Argentina  2000        C   
3          ARG  Argentina  2000        D   
4          ARG  Argentina  2000        E   

                                            Industry         Measure  \
0  Manufacture of food products, beverages and to...  Current prices   
1  Manufacture of textiles, wearing apparel, leat...  Current prices   
2  Manufacture of wood and of products of wood an...  Current prices   
3         Manufacture of paper products and printing  Current prices   
4  Manufacture of coke and refined petroleum prod...  Current prices   

   Value LCU (million national currency)  Value USD (million USD)  
0                               16986.10                  16994.6  
1                                5944.13                   5947.1  
2                                1551.72                   1552.5  
3                             

In [28]:
unique_measures = cleaned_data2['Industry'].unique()
print(unique_measures)

['Manufacture of food products, beverages and tobacco products'
 'Manufacture of textiles, wearing apparel, leather and related products'
 'Manufacture of wood and of products of wood and cork, except furniture; manufacture of articles of straw and plaiting materials'
 'Manufacture of paper products and printing'
 'Manufacture of coke and refined petroleum products'
 'Manufacturing of chemicals and pharmaceutical products'
 'Manufacture of rubber and plastics products'
 'Manufacture of other non-metallic mineral products'
 'Manufacture of basic metals'
 'Manufacture of fabricated metal products, machinery and equipment'
 'Manufacture of transport equipment'
 'Manufacture of furniture, other manufacturing']


Looking at the data, there isn't any obvious short form words or abbreviations I can use to describe each industry so I will just be using the letters A-L, and the user can refer to a reference list to see what each letter represents

In [29]:
unique_measures = cleaned_data2['Ind Code'].unique()
print(unique_measures)

['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L']


In [30]:
# Mapping dictionary
ind_code_mapping = {
    '10-12': 'A',
    '13-15': 'B',
    '16': 'C',
    '17-18': 'D',
    '19': 'E',
    '20-21': 'F',
    '22': 'G',
    '23': 'H',
    '24': 'I',
    '25-28': 'J',
    '29-30': 'K',
    '31-32': 'L'
}

# Replace 'Div Code' values
cleaned_data2['Ind Code'] = cleaned_data2['Ind Code'].replace(ind_code_mapping)

print(cleaned_data2.head())

  Country Code    Country  Year Ind Code  \
0          ARG  Argentina  2000        A   
1          ARG  Argentina  2000        B   
2          ARG  Argentina  2000        C   
3          ARG  Argentina  2000        D   
4          ARG  Argentina  2000        E   

                                            Industry         Measure  \
0  Manufacture of food products, beverages and to...  Current prices   
1  Manufacture of textiles, wearing apparel, leat...  Current prices   
2  Manufacture of wood and of products of wood an...  Current prices   
3         Manufacture of paper products and printing  Current prices   
4  Manufacture of coke and refined petroleum prod...  Current prices   

   Value LCU (million national currency)  Value USD (million USD)  
0                               16986.10                  16994.6  
1                                5944.13                   5947.1  
2                                1551.72                   1552.5  
3                             

In [31]:
unique_measures = cleaned_data2['Ind Code'].unique()
print(unique_measures)

['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L']


---

We want to make a new dataframe which has the total value for each industry globally

In [32]:
# Group by 'Ind Code' and aggregate the sum of 'Value LCU (million national currency)' and 'Value USD (million USD)'
industry_total = cleaned_data2.groupby(['Ind Code', 'Industry']).agg({
    'Value LCU (million national currency)': 'sum',
    'Value USD (million USD)': 'sum'
}).reset_index()

# Rename the columns
industry_total.columns = ['Ind Code', 'Industry', 'LCU', 'USD']

print(industry_total.head())

  Ind Code                                           Industry           LCU  \
0        A  Manufacture of food products, beverages and to...  1.155202e+10   
1        B  Manufacture of textiles, wearing apparel, leat...  6.641171e+09   
2        C  Manufacture of wood and of products of wood an...  1.164530e+09   
3        D         Manufacture of paper products and printing  2.177230e+09   
4        E  Manufacture of coke and refined petroleum prod...  4.731789e+09   

           USD  
0  41064493.27  
1  11988448.40  
2   5113218.95  
3  13774759.69  
4  15622600.06  


The rate of growth year by year for each industry in each country will be left for Tableau, as it's easier to show these measures there.  

In [33]:
# Export cleaned_data2 to CSV
cleaned_data2.to_csv('cleaned_data2.csv', index=False)

# Export industry_total to CSV
industry_total.to_csv('industry_total.csv', index=False)