# **Initial EDA**

## Objectives

* Performing basic EDA using the cleaned and featured datasets

## Inputs

* df_eng_final.csv
* gdp_final

## Additional Comments




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone'

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [7]:
# Load the datasets
df_eng_final = pd.read_csv(os.path.join(current_dir, 'df_eng_final.csv'))
industry_ref = pd.read_csv(os.path.join(current_dir, 'industry_reference.csv'))
gdp_final = pd.read_csv(os.path.join(current_dir, 'gdp_final.csv'))


# Display the first few rows of each dataframe to confirm they are loaded correctly
print(df_eng_final.head(2))
print('----------------------------------------------------------------------------------')
print(industry_ref.head(2))
print('----------------------------------------------------------------------------------')
print(gdp_final.head(2))

   Year    Country ISO3 Continent Ind Code  \
0  2000  Australia  AUS   Oceania        A   
1  2000  Australia  AUS   Oceania        B   

                                            Industry  Value (LCU)  \
0  Manufacture of food products, beverages and to...     19295.29   
1  Manufacture of textiles, wearing apparel, leat...      2967.56   

   Value (USD)  Exchange Rate  Volatility  
0      11186.8       0.579768    0.143969  
1       1720.5       0.579769    0.143969  
----------------------------------------------------------------------------------
  Ind Code                                           Industry
0        A  Manufacture of food products, beverages and to...
1        B  Manufacture of textiles, wearing apparel, leat...
----------------------------------------------------------------------------------
     Country ISO3  Year       GDP
0  Australia  AUS  2000  416168.0
1  Australia  AUS  2001  379629.0


---

# EDA

Lets start by seeing the distribution of industries globally. We'll take the data from 2018 as this is the latest year which has most of the main countries available.

In [9]:
# Filter the data for the year 2018
df_eng_final_2018 = df_eng_final[df_eng_final['Year'] == 2018]

# Create the sunburst chart
fig = px.sunburst(data_frame=df_eng_final_2018, path=['Continent', 'Country'], values='Value (USD)',
                  color='Value (USD)', hover_name='ISO3')
fig.show()

In [11]:
df_eng_final_2018['Country'].unique()

array(['Australia', 'Austria', 'Belgium', 'Brazil', 'Brunei Darussalam',
       'Bulgaria', 'Cambodia', 'Canada', 'Chile', 'China',
       'Chinese Taipei', 'Colombia', 'Costa Rica', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France',
       'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Ireland', 'Italy', 'Japan', 'Kazakhstan', 'Korea',
       'Laos', 'Malaysia', 'Malta', 'Mexico', 'Morocco', 'Myanmar',
       'Netherlands', 'New Zealand', 'North Macedonia', 'Norway', 'Peru',
       'Philippines', 'Poland', 'Portugal', 'Romania',
       'Russian Federation', 'Saudi Arabia', 'Serbia', 'Singapore',
       'Slovenia', 'South Africa', 'Spain', 'Sweden', 'Thailand',
       'Tunisia', 'United States', 'Viet Nam'], dtype=object)

The sunburst chart above provides a hierarchical visualization of the distribution of industries globally for the year 2018. The chart is structured with continents at the innermost level, followed by countries. The size of each segment represents the value in USD, with larger segments indicating higher values.

Key observations from the chart:
1. **China**: Dominates the chart with the highest values across multiple industries, particularly in the manufacture of fabricated metal products, machinery, and equipment.
2. **United States**: Also shows significant values, especially in the manufacture of fabricated metal products, machinery, and equipment, as well as the manufacture of transport equipment.
3. **Other Countries**: Countries like Japan, India, and South Korea also contribute significantly but are overshadowed by the top two countries. Germany also shows a very healthy contribution and is by far the biggset contributer in Europe.

The chart highlights the concentration of industrial value in a few key countries, with China leading by a substantial margin. This concentration suggests that these countries play a crucial role in the global industrial landscape, particularly in the manufacturing sectors.

In [15]:
# Aggregate data by industry and country
aggregated_data = df_eng_final_2018.groupby(['Ind Code', 'ISO3'])['Value (USD)'].sum().reset_index()

# Create the bar chart
fig = px.bar(aggregated_data, x='Ind Code', y='Value (USD)', color='ISO3', title='Value (USD) for Each Industry in 2018 by Country')


# Show the plot
fig.show()

In [16]:
industry_ref

Unnamed: 0,Ind Code,Industry
0,A,"Manufacture of food products, beverages and to..."
1,B,"Manufacture of textiles, wearing apparel, leat..."
2,C,Manufacture of wood and of products of wood an...
3,D,Manufacture of paper products and printing
4,E,Manufacture of coke and refined petroleum prod...
5,F,Manufacturing of chemicals and pharmaceutical ...
6,G,Manufacture of rubber and plastics products
7,H,Manufacture of other non-metallic mineral prod...
8,I,Manufacture of basic metals
9,J,"Manufacture of fabricated metal products, mach..."


The bar chart supports our findings in the sunburst chart that most of the value in each specific industry is coming from the countries with the most value overall. Using this information, we shall filter our dataset to only include the top 15 countries.

---

## Top 15 countries

Lets check which countries were in the top 15

In [18]:
# Aggregate the data by country to get the total USD value for each country
country_aggregated = df_eng_final_2018.groupby('Country')['Value (USD)'].sum().reset_index()

# Sort the countries by the total USD value in descending order and select the top 10
top_15_countries = country_aggregated.sort_values(by='Value (USD)', ascending=False).head(15)['Country']

# Filter df_eng_final_2018 to include only the top 10 countries
df_15 = df_eng_final_2018[df_eng_final_2018['Country'].isin(top_15_countries)]

# Display the filtered dataframe
df_15

Unnamed: 0,Year,Country,ISO3,Continent,Ind Code,Industry,Value (LCU),Value (USD),Exchange Rate,Volatility
945,2018,Brazil,BRA,South America,A,"Manufacture of food products, beverages and to...",177448.03,48565.00,0.273686,1.017185e-01
946,2018,Brazil,BRA,South America,B,"Manufacture of textiles, wearing apparel, leat...",67882.23,18578.40,0.273686,1.017185e-01
947,2018,Brazil,BRA,South America,C,Manufacture of wood and of products of wood an...,12896.18,3529.50,0.273686,1.017185e-01
948,2018,Brazil,BRA,South America,D,Manufacture of paper products and printing,47377.69,12966.60,0.273686,1.017185e-01
949,2018,Brazil,BRA,South America,E,Manufacture of coke and refined petroleum prod...,102434.63,28034.90,0.273686,1.017185e-01
...,...,...,...,...,...,...,...,...,...,...
13478,2018,United States,USA,North America,H,Manufacture of other non-metallic mineral prod...,65185.24,65185.24,1.000000,4.330824e-08
13479,2018,United States,USA,North America,I,Manufacture of basic metals,80279.76,80279.76,1.000000,4.330824e-08
13480,2018,United States,USA,North America,J,"Manufacture of fabricated metal products, mach...",685707.06,685707.06,1.000000,4.330824e-08
13481,2018,United States,USA,North America,K,Manufacture of transport equipment,308436.50,308436.47,1.000000,4.330824e-08


In [19]:
# Check the countries in the filtered dataframe
df_15['Country'].unique()

array(['Brazil', 'Canada', 'China', 'Chinese Taipei', 'France', 'Germany',
       'India', 'Indonesia', 'Italy', 'Japan', 'Korea', 'Mexico',
       'Russian Federation', 'Spain', 'United States'], dtype=object)

Before we move on, let us check whether we have the latest information on the countries above. If not then we shall proceed with the 2018 data and add a disclaimer in our evaluation that the data is not the very latest.

In [21]:
def print_top_15_countries(df, years):
    for year in years:
        # Filter the data for the given year
        df_year = df[df['Year'] == year]
        
        # Aggregate the data by country to get the total USD value for each country
        country_aggregated = df_year.groupby('Country')['Value (USD)'].sum().reset_index()
        
        # Sort the countries by the total USD value in descending order and select the top 15
        top_15_countries = country_aggregated.sort_values(by='Value (USD)', ascending=False).head(15)['Country'].values

        # Print the top 15 countries for the given year
        print(f"Top 15 countries for {year}: {top_15_countries}")
        print('----------------------------------------------------------------')

# Example usage
years = [2018, 2019, 2020, 2021]  # Replace with the years you want to analyze
print_top_15_countries(df_eng_final, years)


Top 15 countries for 2018: ['China' 'United States' 'Japan' 'Germany' 'Korea' 'India' 'Italy'
 'France' 'Brazil' 'Mexico' 'Russian Federation' 'Indonesia'
 'Chinese Taipei' 'Canada' 'Spain']
----------------------------------------------------------------
Top 15 countries for 2019: ['United States' 'Japan' 'Germany' 'Italy' 'France' 'Mexico'
 'Russian Federation' 'Spain' 'Netherlands' 'Poland' 'Austria' 'Belgium'
 'Sweden' 'Czech Republic' 'Denmark']
----------------------------------------------------------------
Top 15 countries for 2020: ['United States' 'Japan' 'Germany' 'Italy' 'France' 'Russian Federation'
 'Mexico' 'Netherlands' 'Austria' 'Belgium' 'Czech Republic' 'Denmark'
 'Romania' 'Finland' 'Hungary']
----------------------------------------------------------------
Top 15 countries for 2021: ['United States' 'France' 'Italy' 'Austria' 'Czech Republic' 'Denmark'
 'Finland' 'Belgium' 'Hungary' 'Ireland' 'Greece' 'Slovenia' 'Estonia'
 'Malta']
---------------------------------

As you can see, we don't have the latest data for China, Japan and Germany. Since these countries are essential in our data, we shall move forward with the 2018 data.

---

# Further EDA

Now that we have our countries, we want to analyse the industries themselves to see how they differ from each country and which ones have the highest value

In [None]:
# sunbust chart for the top 15 countries in 2018 with industry breakdown
fig = px.sunburst(data_frame=df_15, path=['Country', 'Ind Code'], values='Value (USD)', color='Value (USD)', hover_name='Ind Code')
fig.show()

In [None]:
# Tree chart for the top 15 countries in 2018 with industry breakdown
fig = px.treemap(data_frame=df_15, path=['Country', 'Ind Code'], values='Value (USD)', color='Value (USD)', hover_name='Ind Code')
fig.show()

In [None]:
industry_ref

As we can clearly see, the Manufacture of fabricated metal products, machinery and equipment (J) has the highest value in most of the top countries by a significant margin. The other top industries in general order are:

- Manufacture of Transport equipment (K)
- Manfacture of food products, beverages and tobacco products (A)
- Manufacturing of chemicals and pharmaceutical products (F)
- Manufacture of basic metals (I)

Lets have a look at these 5 and see how they have developed over the years.

In [None]:
# Filter the data for the required Ind Codes
ind_codes = ['J', 'K', 'A', 'F', 'I']
filtered_df = df_eng_curr[df_eng_curr['Ind Code'].isin(ind_codes)]

# Create a time series plot for each Ind Code using plotly
for ind_code in ind_codes:
    industry = filtered_df[filtered_df['Ind Code'] == ind_code]['Industry'].unique()[0]
    fig = px.line(filtered_df[filtered_df['Ind Code'] == ind_code], x='Year', y='Value (USD)', color='Country',
                  title=f'Time Series of Value (million USD) for the {industry} industry')
    fig.show()


Looking at the graphs, we can see that China has grown significantly and consistently from the year 2000 in the 5 industries we've filtered, to the point where it is the market leader in all 5 by a long margin, even overtaking the USA. The manufacturing of facbricated metals and machine equipment is by far its strongest industry. Other than the USA which has shown some growth among the industries, the other countries have had comparively little to no growth and have remained stagnant. Countries like Japan have actually declined in some industries over the years like in the manufacture of transport equipment.  

---