<a href="https://colab.research.google.com/github/Colsai/DATA-690-WANG/blob/master/hw13/JET_program_plotly_prac.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Short Look at The JET Program Participant Countries
## By Participants and Countries of Origin
## Focusing on Plotly Express Visualizations
![JET Program](https://www.nz.emb-japan.go.jp/images/Jet_logo_small.jpg)

## Introduction: The JET Program 
The JET program is an international Japanese teaching exchange program, where those interested in Japan can find teaching placements in Japanese primary and secondary schools as English language assistants. 



## Notebook Focus:
I will practice a few new types of visualizations for this homework.
I would like to create
- A pie chart
- A world heatmap
- Some other visualizations, if possible.

## 1. Import Packages for this Analysis
We will just focus on using pandas and plotly.express for this notebook, since this is a mini-notebook for focusing on specific things. 

In [147]:
#Import Packages for this short analysis
import pandas as pd
import plotly.express as px

In [148]:
#Update Plotly So That we can use some of the newer graphs
!pip install --upgrade plotly

Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.12.0)


## 2. Web Scrape from the website:   
There are some statistics about country participants on this website here:
http://jetprogramme.org/en/countries/

In [149]:
#Run the Web Scrape on the website
site = 'http://jetprogramme.org/en/countries/'
df = pd.read_html(site)

In [150]:
#Set the data frame being used to the first table in the dataframe
df_jet = df[0]

## 3. Let's look at the head/tail/sample of the data for this data set.


In [151]:
#Let's look at the first 5 countries
df_jet.head(5)

Unnamed: 0,Country,ALT,CIR,SEA,Total
0,United States,2958,145,2,3105
1,United Kingdom,528,32,0,560
2,Australia,321,22,0,343
3,New Zealand,236,12,3,251
4,Canada,531,26,0,557


In [152]:
#Let's look at the last 5 countries
df_jet.tail(20)

Unnamed: 0,Country,ALT,CIR,SEA,Total
44,Kingdom of Tonga,1,0,0,1
45,Vietnam,0,7,0,7
46,Saint Vincent and the Grenadines,2,0,0,2
47,Uzbekistan,0,2,0,2
48,Seychelles,1,0,0,1
49,Croatia,0,1,0,1
50,United Republic of Tanzania,0,0,1,1
51,Republic of Malta,2,0,0,2
52,Republic of Estonia,4,0,0,4
53,Republic of Lithuania,0,2,0,2


In [153]:
#Let's Sample 15 random countries here
df_jet.sample(15)

Unnamed: 0,Country,ALT,CIR,SEA,Total
62,4th Year,532,53,1,586
22,Indonesia,0,5,1,6
19,Finland,1,2,0,3
51,Republic of Malta,2,0,0,2
63,5th Year,348,19,0,367
20,Mongolia,0,6,0,6
26,Netherlands,3,4,0,7
12,Peru,0,1,0,1
11,Brazil,0,17,0,17
32,Singapore,63,14,0,77


In [154]:
#Shape of the Data
df_jet.shape

(64, 5)

In [155]:
#Describe the data
df_jet.describe()

Unnamed: 0,Country,ALT,CIR,SEA,Total
count,64,64,64,64,64
unique,64,24,27,7,31
top,Republic of Bulgaria,0,0,0,1
freq,1,27,15,49,16


## 4. Fix some of the data
One of the issues here that we saw in the tail was that we have data from the last 5 years. This data is different from the data we have from our countries in the first ~50 rows. There are also two rows that have a total and totals by years. I'll segment this into three dataframes. 

In [156]:
#We will keep the df_jet dataframe, but create 3 different dataframes out of it.

#Break the dataframe into two distinct sets of data
df_countries = df_jet[:-7] #This is the data until the last seventh row, the table

df_totals = df_jet[-7:-5] #This is the data from the last seventh to last fifth

df_years = df_jet[-5:] #This is the last 5 rows of the data

## 5. Let's look at the new dataframes:
Did they accurately divide into different things?

In [157]:
#Let's look at the tail of the data for the new dataset. Does it catch any incorrect values?
df_countries.tail(5)

Unnamed: 0,Country,ALT,CIR,SEA,Total
52,Republic of Estonia,4,0,0,4
53,Republic of Lithuania,0,2,0,2
54,Federal Democratic Republic of Ethiopia,0,0,1,1
55,Republic of the Union of Myanmar,0,1,0,1
56,Republic of Chile,0,1,0,1


In [158]:
df_totals.head(5) #It shows a breakdown of all countries and years by country

Unnamed: 0,Country,ALT,CIR,SEA,Total
57,Total Participants from All Countries,5234,514,13,5761
58,Totals By Years of Programme,ALT,CIR,SEA,Total


## 6. Visualizations with Plotly.Express
Let's make a pie chart with the participants

In [159]:
#Create a very basic bar chart with country
fig = px.bar(df_years, 
             x='Country', 
             y='ALT',
             title = "JET Participants By Year (1st to 5th Years)",
             hover_data=['Total','CIR','SEA'])

fig.update_layout(title_font_size = 20,)

fig.show()

In [160]:
#Organize by year, drop the original index values from
df_years = df_years.sort_values(by="Country", ignore_index=True)

In [161]:
#Let's just check to make sure that the dataframe looks OK
df_years.head()

Unnamed: 0,Country,ALT,CIR,SEA,Total
0,1st Year,1885,203,3,2091
1,2nd Year,1602,138,6,1746
2,3rd Year,867,101,3,971
3,4th Year,532,53,1,586
4,5th Year,348,19,0,367


## 7. So what do these job positions mean?

| Position      | Description |
| ----------- | ----------- |
| ALT      | Assistant Language Teacher (they work in a school)       |
| CIR   | Coordinator for International Relations (they work as translators)        |
| SEA | Sports Exchange Advisors (they work in schools and with sports programs)|

Let's try some simple visualizations to see how many participants there are in each group, by year

In [163]:
#Let's just take the first four columns. 'Total' doesn't really serve any purpose here.
df_years = df_years[['Country','ALT','CIR','SEA']]

## 8. Line Graph of Particpants over years
This line graph will look at how many participants start and are leaving each year.

In [164]:
#Let's make a line graph of participants, per year
fig = px.line(df_years, 
             x='Country', 
             y=["ALT", "CIR", "SEA"], 
             title="Graph of JET Participants, by Year (1st to 5th)",
              width = 600,
              height = 400.
             )

#Show the figure
fig.show()

This is kind of helpful, but SEA and CIR lines don't really tell us much.
  
  
Let's See How Much 'Loss' We Might Expect Per Year, by using a 
percent formula to see how much loss each has.

In [165]:
#Percent Change function
def percent_change(input_list, rounded_val=2):
    percent_list = []

    for this_year in input_list:
        try:
              pct_increase = ((this_year - last_year) / last_year) * 100
        except:
              pct_increase = 0

        percent_list.append(round(pct_increase,2))
        last_year = this_year

    percent_list = [round(i, rounded_val) for i in percent_list] #This function combines rounding inside of it, since percents are easier to use

    return percent_list

In [166]:
#Convert into a list function
def convert_list_int(df):
    ls_of_df = df.to_list()
    ls_of_df = [int(i) for i in ls_of_df]
    
    return ls_of_df

In [167]:
#Run it for all three (probably should be remade into a list later)
alt_pct = percent_change(convert_list_int(df_years['ALT']))
cir_pct = percent_change(convert_list_int(df_years['CIR']))
sea_pct = percent_change(convert_list_int(df_years['SEA']))

In [168]:
#Insert it back into the df
df_years.insert(2, "ALT Percent", alt_pct)
df_years.insert(4, "CIR Percent", cir_pct)
df_years.insert(6, "SEA Percent", sea_pct)

In [169]:
#Check to see if they were inserted
df_years.head()

Unnamed: 0,Country,ALT,ALT Percent,CIR,CIR Percent,SEA,SEA Percent
0,1st Year,1885,0.0,203,0.0,3,0.0
1,2nd Year,1602,-15.01,138,-32.02,6,100.0
2,3rd Year,867,-45.88,101,-26.81,3,-50.0
3,4th Year,532,-38.64,53,-47.52,1,-66.67
4,5th Year,348,-34.59,19,-64.15,0,-100.0


In [177]:
#Let's make a line graph of participants, per year
fig = px.line(df_years, 
             x='Country', 
             y=["ALT Percent", "CIR Percent", "SEA Percent"], 
             title="Graph of JET Participants, by Year (1st to 5th)",
              width = 600,
              height = 400.
             )

#Show the figure
fig.show()

Unfortunately, we see weird things happening with the SEA lines, since the enrollment in the JET program changes a lot per year for a position that has few participants. 

Let's remove and re-visualize this information without SEA. There are very few SEA, so the graph doesnt look very helpful here.

In [182]:
#Let's make a line graph of participants, per year
fig = px.bar(df_years, 
             x='Country', 
             y=["ALT Percent", "CIR Percent"], 
             title="JET Program- What year do most participants leave?",
              width = 800,
              height = 600.
             )

#Show the figure
fig.show()

## Analysis: 9. ALT and CIR teachers seem to leave at different times
In this visualization, it looks like there is the most significant decline for remaining ALT teachers during their third year.

For CIRS, entering the fifth and final year is the most common rate change decline.

In [170]:
#Where are JET Participants From?
n = 10

fig = px.pie(df_countries.head(n), 
             values='Total', 
             names='Country', 
             labels = 'Country', 
             title=f'Jet Participants by top-{n} Countries',
             width=900,
             height=900)

fig.show()

This wasn't that helpful since it doesn't really include country names or give a great feel for what we are looking for. 

Let's try to make a graphic that captures more details of the elements of the data here:

In [171]:
#Sunburst (Needs to have plotly upgraded to recent to work-- see top)
top_number = 25

fig = px.sunburst(df_countries[0:top_number], 
                  path=['SEA', 'CIR', 'ALT', 'Country'], 
                  values='Total',
                  color='Total', 
                  hover_data=['Country'],
                  title=f"Graph of Top-{top_number} Participant Countries in the JET Program",
                  width=900, 
                  height=900)

fig.update_layout(
    font_family="Arial",
    font_size = 16,
    font_color="red",
    title_font_family="Pontano Sans",
    title_font_color="black",
    title_font_size = 30,
)

#Show the Fig
fig.show()

## 10. Conclusion: 
In terms of when they leave, JET program participants leave at different times- CIRS seem to leave most in their fourth year, where as ALTs seem to leave more in their third year.

The United States is by far the biggest contributor of ALTs, CIRS, and, except for New England, SEAs as well.