# Technical Requirements

## The technical requirements for Project 1 are as follows:
 * Use Pandas to clean and format all dataset(s).
 * Create a Jupyter Notebook describing the data exploration and cleanup process.
 * Create a Jupyter Notebook illustrating the final data analysis.
 * Use PyViz, GeoViews, and Hvplot to create six to eight data visualizations (ideally, at least two per question asked of the data).
 * Save PNG images of the visualizations to distribute to the class and instructional team, as well as for the presentation, and your repo's README.md file.
 * Use one new Python library that hasn't been covered in class.
 
 * Optional: Use at least one API, if an API can be found with data pertinent to your primary research questions.

 * Create a README.md in your repo with a write-up summarizing your major findings. This should include a heading for each question that was asked of your data, with a short description of what you found and any relevant plots under each heading.

# Key Questions:
### How has the global internet user population evolved from its inception to the present day?
### Which countries have experienced the highest growth in internet users during different time periods?
### Is there a correlation between cellular subscription rates and internet user rates?
### How does broadband subscription availability influence internet user adoption and growth?
### Are there variations in broadband subscription rates among different countries and years?
### How do broadband subscription rates relate to the number of internet users in each country?
### What impact does the availability of broadband have on internet penetration rates?
### Can the growth of internet users and broadband subscriptions be linked to a country's economic development?

# Internet growth in the world over the years.

In [2]:
# Initial imports
import pandas as pd
import numpy as np
from pathlib import Path
import hvplot.pandas

In [3]:
# Read the Global_Internet_users.csv to a dataframe
global_Internet_users_df = pd.read_csv(Path("Resources/Global_Internet_users.csv"), index_col=0).dropna()

#Clean data
global_Internet_users_df = global_Internet_users_df.rename(columns={"Entity": "Country"}).drop(columns="Code").loc[global_Internet_users_df['Year'] > 1989]

# Review the DataFrame
display(global_Internet_users_df.head(5))
display(global_Internet_users_df.tail(5))

Unnamed: 0,Country,Year,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
10,Afghanistan,1990,0.0,0.0,0,0.0
11,Afghanistan,1991,0.0,0.0,0,0.0
12,Afghanistan,1992,0.0,0.0,0,0.0
13,Afghanistan,1993,0.0,0.0,0,0.0
14,Afghanistan,1994,0.0,0.0,0,0.0


Unnamed: 0,Country,Year,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
8862,Zimbabwe,2016,91.793457,23.119989,3341464,1.217633
8863,Zimbabwe,2017,98.985077,24.4,3599269,1.315694
8864,Zimbabwe,2018,89.404869,25.0,3763048,1.406322
8865,Zimbabwe,2019,90.102287,25.1,3854006,1.395818
8866,Zimbabwe,2020,88.755806,29.299999,4591211,1.368916


## Worldwide internet usage over the years

In [6]:
# Worldwide internet usage
world_internet = global_Internet_users_df[global_Internet_users_df["Country"] == "World"].groupby(["Year", "Country"]).mean()
world_internet.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
Year,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1990,World,0.21093,0.049235,2617438.0,0.0
1991,World,0.301528,0.079181,4280727.0,0.0
1992,World,0.426715,0.125364,6885825.0,0.0
1993,World,0.617955,0.1789,9978025.0,0.0
1994,World,0.989809,0.3599,20372971.0,0.0
1995,World,1.584408,0.681457,39137572.0,0.0
1996,World,2.497344,1.32347,77094037.0,0.0
1997,World,3.652477,2.039509,120463190.0,0.0
1998,World,5.33032,3.136406,187786430.0,0.0
1999,World,8.119304,4.62949,280906271.0,0.0


In [7]:
# Worldwide internet usage
world_internet.hvplot(
    x='Year', 
    y='No. of Internet Users',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='No. of Internet Users',     
    title = 'Internet users in the world over the years'
)

## Internet adaption Between low & high income country over the years

In [25]:
# low income country internet adaption
world_internet_low_income = global_Internet_users_df[global_Internet_users_df["Country"] == "Low income"].groupby(["Year", "Country"]).mean()
world_internet_low_income.tail(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
Year,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,Low income,34.760529,3.795007,0.0,0.106787
2012,Low income,39.766285,4.610525,0.0,0.153772
2013,Low income,46.241608,5.386801,0.0,0.212381
2014,Low income,52.142506,7.023655,0.0,0.311067
2015,Low income,55.225636,9.158152,0.0,0.367887
2016,Low income,53.422337,11.100526,0.0,0.402788
2017,Low income,53.422897,13.800062,0.0,0.410638
2018,Low income,60.467735,16.288969,0.0,0.0
2019,Low income,55.59174,18.175982,0.0,0.418275
2020,Low income,57.866852,20.615334,0.0,0.471478


In [26]:
# low income country internet adaption
world_internet_low_income = world_internet_low_income.hvplot(
    x='Year', 
    y='Internet Users(%)',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='Internet Users %',     
    title = 'Internet adaption in low income country over the years'
)

world_internet_low_income

In [22]:
# High income country internet adaption
world_internet_high_income = global_Internet_users_df[global_Internet_users_df["Country"] == "High income"].groupby(["Year", "Country"]).mean()
world_internet_high_income.tail(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
Year,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011,High income,112.956345,72.799301,0.0,27.540102
2012,High income,115.416412,75.499733,0.0,28.475428
2013,High income,117.210274,76.85537,0.0,29.948442
2014,High income,121.379593,78.494446,0.0,30.598824
2015,High income,123.212174,79.930954,0.0,31.338318
2016,High income,124.753593,84.273598,0.0,32.238873
2017,High income,124.861229,85.774185,0.0,33.173477
2018,High income,121.195236,87.48468,0.0,33.871544
2019,High income,122.530327,89.041252,0.0,34.651367
2020,High income,122.305847,89.821617,0.0,35.969559


In [23]:
# High income country internet adaption
world_internet_high_income = world_internet_high_income.hvplot(
    x='Year', 
    y='Internet Users(%)',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='Internet Users %',     
    title = 'Internet adaption in high income country over the years'
)
world_internet_high_income 

In [27]:
# Compare Low and high income country 
(world_internet_high_income * world_internet_low_income).opts(title = 'Internet adaption between Low and High income country')

## Worldwide Cellular/ Broadband Subscription over the years

In [12]:
# Worldwide Cellular Subscription%
World_cellular_line = world_internet.hvplot(
    x='Year', 
    y='Cellular Subscription',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='Cellular Subscription %',     
    title = 'Cellular Subscription % in the world over the years'
)
World_cellular_line

In [13]:
# Worldwide Broadband Subscription%
world_broadband_line = world_internet.hvplot(
    x='Year', 
    y='Broadband Subscription',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='Broadband Subscription %',     
    title = 'Broadband Subscription % in the world over the years'
)
world_broadband_line

In [14]:
# compare both chart 
(world_broadband_line * corld_cellular_line).opts(title = 'Broadband Subscription Vs Cellular Subscription int he world over the years ')

## The percentage of internet usage of each countries by Year

In [5]:
# % of internet useres by country
percent_internet_users_line = global_Internet_users_df.hvplot.line(
    x= 'Year',  
    y= 'Internet Users(%)',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='Internet Users(%)',     
    title = '% of internet useres by country during the years',
    groupby='Country'
)
percent_internet_users_line

In [6]:
# Internet Users(%) over the years by Country
global_Internet_users_df.hvplot.bar(
    x='Country', 
    y='Internet Users(%)',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='Internet Users(%)',     
    title = 'Internet Users(%) over the years by Country',
    groupby='Year',
    rot=90,
    frame_width=2000,
    frame_height=500,
    
)

In [7]:
# internet usage Dataframe grouped by year and Country
# global_Internet_year_country = global_Internet_users_df.drop(columns=["Cellular Subscription", "No. of Internet Users", "Broadband Subscription"])
# global_Internet_year_country = global_Internet_year_country[~global_Internet_year_country["Country"].isin(["World", 'North America'])].groupby(["Year", "Country"]).mean()

# global_Internet_year_country.hvplot.barh('Year', 'Internet Users(%)', by='Country', stacked=True, width=1500, height=600)

## The countries that lead the use of the internet over the years.

In [7]:
# The countries that have led the way in internet usage over the years.

# drop the rows about World
global_Internet_df = global_Internet_users_df[~global_Internet_users_df["Country"].isin(["World", 'North America'])]

# Get the hightest No. of Internet Users in each year
max_global_Internet_df = global_Internet_df.groupby(["Year"]).max('No. of Internet Users')

# Filter the countries with the hightest No. of Internet Users in each year
global_lead_internet_user = global_Internet_df[global_Internet_df['No. of Internet Users'].isin(max_global_Internet_df['No. of Internet Users'])]
global_lead_internet_user = global_lead_internet_user.sort_values(by='Year', ascending=True).set_index('Year')

# display(global_Internet_users_df.loc[global_Internet_users_df['Year'] == 1991].sort_values(by='No. of Internet Users', ascending=False).head(10))
global_lead_internet_user.head()

Unnamed: 0_level_0,Country,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1990,United States,2.09545,0.784729,1946784,0.0
1991,United States,2.968951,1.163194,2926132,0.0
1992,United States,4.293057,1.724203,4399739,0.0
1993,United States,6.168585,2.271673,5878630,0.0
1994,United States,9.203138,4.862781,12753789,0.0


In [8]:
global_lead_internet_user_bar = global_lead_internet_user.hvplot.bar(
    x= 'Year',  
    y= 'No. of Internet Users',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='No. of Internet Users',     
    hover_cols='Country',
    title = 'countries that lead the use of the internet over the years.',
    rot=80,
    legend="top_left",
    # color='Country',
)
global_lead_internet_user_bar

### Adding Geo Location

In [9]:
# GEO Location of all contries
country_geo_location = pd.read_csv(Path("Resources/countries.csv"),index_col=0).dropna()

# Clean data
country_geo_location = country_geo_location.drop(columns=["iso3", "iso2", "numeric_code", "phone_code", "capital", "currency", "currency_name", "currency_symbol", "tld", "native", "region", "subregion", "timezones", "emoji", "emojiU"])
country_geo_location = country_geo_location.rename(columns={"name": "Country"})

country_geo_location.head()

Unnamed: 0_level_0,Country,latitude,longitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Afghanistan,33.0,65.0
2,Aland Islands,60.116667,19.9
3,Albania,41.0,20.0
4,Algeria,28.0,3.0
5,American Samoa,-14.333333,-170.0


In [10]:
# Merge the Global Internet datafreme with the GEO Location of all contries 
global_internet_geo = global_Internet_users_df.set_index(["Country"]).merge(country_geo_location.set_index(["Country"]), left_index=True, right_index=True)
global_internet_geo.reset_index().groupby(["Year", "Country"]).mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription,latitude,longitude
Year,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1990,Afghanistan,0.000000,0.000000,0.0,0.000000,33.000000,65.000000
1990,Albania,0.000000,0.000000,0.0,0.000000,41.000000,20.000000
1990,Algeria,0.001825,0.000000,0.0,0.000000,28.000000,3.000000
1990,American Samoa,0.000000,0.000000,0.0,0.000000,-14.333333,-170.000000
1990,Andorra,0.000000,0.000000,0.0,0.000000,42.500000,1.500000
...,...,...,...,...,...,...,...
2020,Venezuela,58.179211,0.000000,0.0,9.008163,8.000000,-66.000000
2020,Vietnam,142.733368,70.300003,67944025.0,17.155838,16.166667,107.833333
2020,Yemen,50.888550,0.000000,0.0,1.310938,15.000000,48.000000
2020,Zambia,103.917831,19.799999,3747688.0,0.447765,-15.000000,30.000000


### Map with the No. of Internet Users for each country in each year

In [11]:
# Plot the map with the No. of Internet Users for each country in each year
global_internet_geo.hvplot.points(
    'longitude', 
    'latitude', 
    geo=True, 
    size='Internet Users(%)',
    color='No. of Internet Users',
    tiles='OSM',
    hover_cols='all',
    frame_width=700,
    frame_height=500,
    title='No. of Internet Users for each country in each year',
    groupby='Year'    
)

# Comparison of GDP and internet usage by countries

In [12]:
# Read the CSV to a dataframe
global_gdp = pd.read_csv(Path("Resources/Global_gdp.csv"))

# Clean data
global_gdp = global_gdp.set_index('Year').rename(columns={"Value": "GDP_Value", "Country or Area": "Country"}).drop(columns="Item")

# Review the DataFrame
display(global_gdp.head())
display(global_gdp.tail())

Unnamed: 0_level_0,Country,GDP_Value
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2021,Afghanistan,372.548875
2020,Afghanistan,516.866543
2019,Afghanistan,500.522664
2018,Afghanistan,502.056771
2017,Afghanistan,530.149831


Unnamed: 0_level_0,Country,GDP_Value
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1974,Zimbabwe,836.465455
1973,Zimbabwe,718.359564
1972,Zimbabwe,603.03037
1971,Zimbabwe,503.731409
1970,Zimbabwe,449.078443


In [13]:
# Joining internet usage Dataframe with GDP DataFrame

# GDP DataFrame grouped by year and Country
gdp_year_country_df = global_gdp.groupby(["Year", "Country"]).mean()

# internet usage Dataframe grouped by year and Country
global_Internet_year_country = global_Internet_users_df[~global_Internet_users_df["Country"].isin(["World", 'North America'])].groupby(["Year", "Country"]).mean()

# Concat internet usage Dataframe with GDP DataFrame
global_Internet_gdp = pd.concat([global_Internet_year_country, gdp_year_country_df], axis=1, join="outer") 

# Drop nulls
global_Internet_gdp = global_Internet_gdp.dropna()

# Show dataFrame
global_Internet_gdp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Cellular Subscription,Internet Users(%),No. of Internet Users,Broadband Subscription,GDP_Value
Year,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1990,Afghanistan,0.0,0.0,0.0,0.0,332.82652
1990,Albania,0.0,0.0,0.0,0.0,651.201284
1990,Algeria,0.001825,0.0,0.0,0.0,2419.907394
1990,Andorra,0.0,0.0,0.0,0.0,24305.217277
1990,Angola,0.0,0.0,0.0,0.0,1154.981017


In [15]:
# The GDP of each country by year
global_Internet_gdp_bar = global_Internet_gdp.hvplot.bar(
    x= 'Year',  
    y= 'GDP_Value',
    yformatter='%.0f',
    xlabel='Year',
    ylabel='GDP_Value',     
    title = 'The GDP of each country by year',
    groupby='Country',
    rot=90
)
global_Internet_gdp_bar

In [16]:
# Cross charts GDP and % Internet users
global_Internet_gdp_bar + percent_internet_users_line