# **Initial EDA**

## Objectives

* Performing basic EDA using the cleaned and featured datasets

## Inputs

* cleaned_data2.csv and industry_total.csv

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sayed\\OneDrive\\Documents\\Code institute\\Capstone\\Capstone'

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [5]:
# Load the datasets
df_final = pd.read_csv(os.path.join(current_dir, 'df_final.csv'))
industry_total = pd.read_csv(os.path.join(current_dir, 'industry_total.csv'))
industry_ref = pd.read_csv(os.path.join(current_dir, 'industry_reference.csv'))

# Display the first few rows of each dataframe to confirm they are loaded correctly
print(df_final.head(2))
print('----------------------------------------------------------------------------------')
print(industry_total.head(2))
print('----------------------------------------------------------------------------------')
print(industry_ref.head(2))

   Year  Country ISO3 Continent Ind Code  \
0  2000  Austria  AUT    Europe        A   
1  2000  Austria  AUT    Europe        B   

                                            Industry  Value (LCU)  \
0  Manufacture of food products, beverages and to...      3931.90   
1  Manufacture of textiles, wearing apparel, leat...      1605.22   

   Value (USD)  Exchange Rate  Volatility  
0      3631.55       0.923612    0.117874  
1      1482.60       0.923612    0.117874  
----------------------------------------------------------------------------------
  Ind Code                                           Industry           LCU  \
0        A  Manufacture of food products, beverages and to...  1.155175e+10   
1        B  Manufacture of textiles, wearing apparel, leat...  6.641122e+09   

           USD  
0  40996209.07  
1  11975995.30  
----------------------------------------------------------------------------------
  Ind Code                                           Industry
0        A

---

# EDA

Lets start by seeing the distribution of industries globally. We'll take the data from 2018 as this is the latest year which has most of the main countries available.

In [13]:
# Filter the data for the year 2018
df_final_2018 = df_final[df_final['Year'] == 2018]

# Create the sunburst chart
fig = px.sunburst(data_frame=df_final_2018, path=['Continent', 'Country'], values='Value (USD)',
                  color='Value (USD)', hover_name='ISO3')
fig.show()

As predicted, Asia is the largest contributer globally in the industries as a whole. Europe and America are not too far behind though. Individually the USA is still the produces the most value from it's industries, closly folowed by China. This data is from 2018 though, and given how much China's economy has grown over the years, I believe it may have overtaken the US. 

In [None]:
# Filter the data for the year 2018
cleaned_data_2018 = df_final[df_final['Year'] == 2018]

# Aggregate data by industry and country
aggregated_data = df_final_2018.groupby(['Ind Code', 'ISO3'])['Value (USD)'].sum().reset_index()

# Create the bar chart
fig = px.bar(aggregated_data, x='Ind Code', y='Value (USD)', color='ISO3', title='Value (USD) for Each Industry in 2018 by Country')


# Show the plot
fig.show()

In [10]:
industry_ref

Unnamed: 0,Ind Code,Industry
0,A,"Manufacture of food products, beverages and to..."
1,B,"Manufacture of textiles, wearing apparel, leat..."
2,C,Manufacture of wood and of products of wood an...
3,D,Manufacture of paper products and printing
4,E,Manufacture of coke and refined petroleum prod...
5,F,Manufacturing of chemicals and pharmaceutical ...
6,G,Manufacture of rubber and plastics products
7,H,Manufacture of other non-metallic mineral prod...
8,I,Manufacture of basic metals
9,J,"Manufacture of fabricated metal products, mach..."


---

The user is interested in investing the top 10 countries, so lets filter our dataset

In [None]:
print

In [14]:
# Aggregate the data by country to get the total USD value for each country
country_aggregated = df_final_2018.groupby('Country')['Value (USD)'].sum().reset_index()

# Sort the countries by the total USD value in descending order and select the top 10
top_10_countries = country_aggregated.sort_values(by='Value (USD)', ascending=False).head(10)['Country']

# Filter df_final_2018 to include only the top 10 countries
df_final_2018_top_10 = df_final_2018[df_final_2018['Country'].isin(top_10_countries)]

# Display the filtered dataframe
df_final_2018_top_10

Unnamed: 0,Year,Country,ISO3,Continent,Ind Code,Industry,Value (LCU),Value (USD),Exchange Rate,Volatility
6900,2018,France,FRA,Europe,A,"Manufacture of food products, beverages and to...",44685.00,52770.93,1.180954,1.193091e-01
6901,2018,France,FRA,Europe,B,"Manufacture of textiles, wearing apparel, leat...",5556.00,6561.38,1.180954,1.193091e-01
6902,2018,France,FRA,Europe,C,Manufacture of wood and of products of wood an...,3057.00,3610.18,1.180955,1.193091e-01
6903,2018,France,FRA,Europe,D,Manufacture of paper products and printing,9736.62,11498.50,1.180954,1.193091e-01
6904,2018,France,FRA,Europe,E,Manufacture of coke and refined petroleum prod...,1986.00,2345.37,1.180952,1.193091e-01
...,...,...,...,...,...,...,...,...,...,...
21121,2018,United States,USA,North America,H,Manufacture of other non-metallic mineral prod...,62197.83,62197.83,1.000000,6.624372e-08
21122,2018,United States,USA,North America,I,Manufacture of basic metals,69110.83,69110.83,1.000000,6.624372e-08
21123,2018,United States,USA,North America,J,"Manufacture of fabricated metal products, mach...",704486.06,704486.06,1.000000,6.624372e-08
21124,2018,United States,USA,North America,K,Manufacture of transport equipment,306767.34,306767.34,1.000000,6.624372e-08


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
