In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Above code is importing necessary modules for the exercise.

In [None]:
gdp_df = pd.read_csv("..\data\gdp_percapita.csv")
gdp_df

Above code is importing the appropriate .csv where '..' symbolizes the need to move up in the file structure to search for the needed file. 

Code below is finding the layout of the DataFrame where the first number represents number of rows and the second represents number of columns. 

In [None]:
gdp_df.shape

#### How many rows and columns are in `gdp_df`? What are the data types of each column?

7176 rows and 4 columns.

In [None]:
gdp_df = gdp_df.drop(columns = ["Value Footnotes"])

Above code is dropping the "Value Footnotes" column.

In [None]:
gdp_df.columns = ['Country', 'Year', 'GDP_Per_capita']

In [None]:
gdp_df

Above code is renaming the columns and displaying the new DataFrame

In [None]:
years = set(list(gdp_df.Year))
print(years)
len(years)

Above code creates a set of all unique years in the DataFrame and the count of values in the set.

In [None]:
year_counts = gdp_df.Country.value_counts()

Above code starts the structuring of a new DataFrame indicating year counts.

In [None]:
year_counts = year_counts.to_frame()

In [None]:
year_counts = year_counts.reset_index()

In [None]:
year_counts.columns = ['country', 'year_count']

Above code finishes organizing and re-labeling new DataFrame 'year_counts'

In [None]:
missing_year_mask = year_counts['year_count'] != len(years)

Above code creates the mask to filter countries that aren't present in every year.

In [None]:
missing_years = year_counts[missing_year_mask]
print('Countries missing data for all years', missing_years.country.count())

Above code masks the year_counts DataFrame to countries with a year_count not equal to 31 and returns the count of countries in the list.

In [None]:
print("Countries with complete data:", year_counts.country.count() - missing_years.country.count())

Above code block subtracts the years missing data from the total count of countries in the data set.

In [None]:
print("Years present in the data:", len(years))

Above code block counts the number of years present in a previously created set of all years.

In [None]:
missing_years

#### 7. How many countries have data for all years? Which countries are missing many years of data? Look at the number of observations per year. What do you notice? 

37 countries are missing data, 205 countries have data for all 31 years in the data. 

The majority of countries have entries for all years. It seems that the countries missing data were not a part of the UN during a portion of the dataset. 

In [None]:
mask_2020 = gdp_df['Year'] == 2020

Creating a mask for information in the year 2020.

In [None]:
gdp2020_df = gdp_df[mask_2020]
gdp2020_df

In [None]:
type(gdp2020_df)

Masking the gdp_df DataFrame to display just the year 2020, checking data type.

In [None]:
sns.histplot(data=gdp2020_df, x="GDP_Per_capita")
plt.show()

Plotting a histogram for 2020 GDP_Per_capita.

In [None]:
sns.kdeplot(gdp2020_df['GDP_Per_capita'])
plt.show()

Plotting a density plot for 2020 GDP_Per_capita.

In [None]:
sns.boxplot( x=gdp2020_df["Year"], y=gdp2020_df["GDP_Per_capita"] )
plt.show()

Creating boxplot for 2020 GDP_Per_capita

In [None]:
sns.violinplot( x=gdp2020_df["Year"], y=gdp2020_df["GDP_Per_capita"] )
plt.show()

Creating a violin plot for 2020 GDP_Per_capita

#### 8. In this question, you're going to create some plots to show the distribution of GDP per capita for the year 2020. Go to the Python Graph Gallery (https://www.python-graph-gallery.com/) and look at the different types of plots under the Distribution section. Create a histogram, a density plot, a boxplot, and a violin plot. What do you notice when you look at these plots? How do the plots compare and what information can you get out of one type that you can't necessarily get out of the others?

Boxplot and histogram do not show negative values by default.Density plot shows lists the density of the variable instead of the count or visual representation of the count. 

In [None]:
print("The median GDP_Per_capita for 2020 is", gdp2020_df.GDP_Per_capita.median())

#### 9. What was the median GDP per capita value in 2020?

The median gdp value in 2020 is 12908.9374056206.

In [None]:
gdp_decades = gdp_df[(gdp_df.Year == 2020) | (gdp_df.Year == 2010) | (gdp_df.Year == 2000) | (gdp_df.Year == 1990)]


Above code subsets the gdp_df DataFrame to a new frame called gdp_decades using the | operator to signify the 'or' condition. 

When filtering for multiple conditions:

| = or, & = and, ~ = not 

In [None]:
gdp_decades

In [None]:
sns.boxplot( x=gdp_decades["Year"], y=gdp_decades["GDP_Per_capita"] )
plt.show()

Boxplot of the gdp_decades DataFrame.

In [None]:
sns.barplot(x=gdp_decades["Year"], y=gdp_decades["GDP_Per_capita"])

Barplot of the gdp_decades DataFrame.

In [None]:
sns.regplot(x=gdp_decades["Year"], y=gdp_decades["GDP_Per_capita"], fit_reg=False)

Scatterplot without a trendline. Linear regression fit is plotted by default.

In [None]:
sns.regplot(x=gdp_decades["Year"], y=gdp_decades["GDP_Per_capita"])

Scatterplot with a linear regression fit plotted by default.

In [None]:
gdp100k = gdp_df[(gdp_df.GDP_Per_capita >= 100000)]
first100k = gdp100k.sort_values(by=['Year']).head(1)

In [None]:
print("The first country to have 100k GDP_Per_capita in our dataset is:", first100k.Country)

The above code first creates a DataFrame of all entries with a GDP_Per_capita greater than 100000,
then sorts by year ascending and limits to just the first result to get our answer. 

#### 11. Which country was the first to have a GDP per capita greater than $100,000?

United Arab Emirates was the first country in the data to have a GDP_Per_capita greater than $100,000. The year this happened was 1990.

In [None]:
maxgdp2020 = gdp2020_df.GDP_Per_capita.max()
maxgdp2020country = gdp2020_df[(gdp2020_df.GDP_Per_capita == maxgdp2020)]
maxgdp2020country

Using a previously created DataFrame for 2020 to find the max() GDP_Per_capita of that year and the country associated with it. 

In [None]:
luxembourg = gdp_df[(gdp_df.Country == 'Luxembourg')]

Above code creates a DataFrame of all 'Luxembourg' entries.

In [None]:
sns.lineplot(x=luxembourg.Year, y=luxembourg.GDP_Per_capita)

Above code generates a lineplot for Luxembourgs GDP_Per_capita by Year.

#### 12. Which country had the highest GDP per capita in 2020? Create a plot showing how this country's GDP per capita has changed over the timespan of the dataset.

Luxembourg had the highest GDP_Per_capita for the year 2020. I chose to plot this on a lineplot as it's generally accepted as the best method to visualize simple changes over time.

In [None]:
mingdp2020 = gdp2020_df.GDP_Per_capita.min()
mingdp2020country = gdp2020_df[(gdp2020_df.GDP_Per_capita == mingdp2020)]
mingdp2020country

In [None]:
burundi = gdp_df[(gdp_df.Country == 'Burundi')]

Repeating the steps above for the minimum GDP_Per_capita country in 2020, creating a dataframe of all years associated with Burundi.

In [None]:
sns.lineplot(x=burundi.Year, y=burundi.GDP_Per_capita)

Again, chose the lineplot to show how Burundi's GDP_Per_capita has changed over time. 

#### 13. Which country had the lowest GDP per capita in 2020? Create a plot showing how this country's GDP per capita has changed over the timespan of the dataset.

Burundi had the lowest GDP_Per_capita for 2020.

#### **Bonus question:** Is it true in general that countries had a higher GDP per capita in 2020 than in 1990? Which countries had lower GDP per capita in 2020 than in 1990?

In [None]:
gdpcompare = gdp_df[(gdp_df.Year == 1990) | (gdp_df.Year == 2020)]
gdp1990_df = gdp_df[(gdp_df.Year == 1990)]

Creating a DataFrame for 1990 and 2020 combined to start comparisons between the two years.

Created an additional DataFrame for just the year 1990.

In [None]:
sns.lineplot(x=gdpcompare.Year, y=gdpcompare.GDP_Per_capita)

GDP_Per_capita does trend upwards from 1990 to 2020. Can't find the best method to graph it, although a linechart does show the upward trend.

# Revisit the second portion of question 13 to find which countries had lower GDP_Per_capita in 2020.

In [None]:
internet_df = pd.read_csv("..\data\internet_use.csv", nrows=4495)

reading in internet_use.csv, initially pulled an error. Investigated the file, found the dictionary of footnotes at the bottom, only imported rows existing before the footnotes. 

In [None]:
internet_df

In [None]:
internet_df.info()

Above codes are checking datatypes and format of the DataFrame

In [None]:
internet_df = internet_df.drop(columns = 'Value Footnotes')

In [None]:
internet_df.columns = ['Country', 'Year', 'Internet_Users_Pct']

Above code drops footnotes column and renames the remaining columns. 

In [None]:
internet_df.Year.value_counts()

above code lists the number of observations per year of the DataFrame.

#### 16. Look at the number of observations in this dataset per year. What do you notice?

Internet use has been reported for more UN members each year since the beginning of the dataset with the exception of a few years where the number reported stayed the same. 

In [None]:
nonzerointernet = internet_df[(internet_df.Internet_Users_Pct > 0)]
firstnonzerointernet = nonzerointernet.sort_values(by=['Year', 'Internet_Users_Pct'])
firstnonzerointernet.head(20)

#### 17. What is the first year to have a non-zero internet users percentage value?

Above code finds all instances of Internet_Users_Pct greater than 0, then sorts by 'Year' and 'Internet_Users_Pct'. There are 19 countries reporting a greater than 0 'Internet_Users_Pct' in 1990. 

In [None]:
internet2014 = internet_df[(internet_df.Year == 2014)]
internet2000 = internet_df[(internet_df.Year == 2000)]

Creating frames of internet for the years 2000 and 2014. 

In [None]:
sns.histplot(data=internet2014, x="Internet_Users_Pct", color="blue", label="2014")
sns.histplot(data=internet2000, x="Internet_Users_Pct", color="red", label="2000")
plt.legend()

#### 18. How does the distribution of internet users percent differ for 2000 and 2014?

Above code plots a histogram of years 2000 and 2014. In the year 2000, the majority of reports still fell under 20%. In 2014, the majority of reports are now over 20%.

In [None]:
under5pct2014 = internet2014[(internet2014.Internet_Users_Pct < 5)]
under5pct2014.shape

Above code is subsetting the internet2014 dataframe to only those with <5% Internet_Users_Pct and reporting the shape of the new set where 16 rows represent 16 different countries. 

#### 19. For how many countries was the percentage of internet users below 5% in 2014?

There were 16 countries in the year 2014 with a Internet_Users_Pct below 5%.