#### 3. Import the required packages with their customary aliases as follows:

- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### 4. Using the pandas read_csv() method, read the GDP dataset into your notebook as a DataFrame called gdp_df. Take a look at the first few and last few rows to familiarize yourself with what is contained in this dataset.

In [2]:
gdp_df = pd.read_csv('../data/gdp_percapita.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../data/gdp_percapita.csv'

In [None]:
gdp_df.head()

In [None]:
gdp_df.tail()

#### 5. How many rows and columns are in gdp_df?

In [None]:
gdp_df.shape

Rows: 7662<br>
Columns: 4

#### What are the data types of each column?

Option 1 - Using .info()

In [None]:
gdp_df.info()

Option 2 - Using dtypes

In [None]:
gdp_df.dtypes

Column Data Types: string, integer, float, float

#### 6. Drop the Value Footnotes column

Option 1 - using .drop()

Note: I assigned this one to a variable that we won't use again. This is because I also wanted to show you .pop() a few cells down, which won't work if I've already dropped it in the original.

In [None]:
gdp_df_drop = gdp_df.drop('Value Footnotes', axis=1)
gdp_df_drop.head()

Option 2 - using .pop() 

Note: With .pop(), you actually can't assign it to a new variable. If you try, the new variable will be just the column you popped out AND the original df will still be affected.

In [None]:
gdp_df.pop('Value Footnotes')
gdp_df.head()

#### and rename the remaining three to 'Country', 'Year', and 'GDP_Per_Capita'.

Option 1 - Using rename with a dictionary; good for when there are a bunch of columns you don't want to change the names of, since you don't have to list out each column<br>
Again, I am assigning this one to a new variable so I can show you how to do it a different way in option 2

In [None]:
gdp_df_rename = gdp_df.rename(columns={'Country or Area': 'Country', 'Value': 'GDP_Per_Capita'})
gdp_df_rename.head()

Option 2 - Using .columns(); good for when you want to change the name of most of your columns and don't want to spend the time creating a dictionary. This method is a bit dangerous because if you miss one column name, all of the column names after that will be wrong.

In [None]:
gdp_df.columns = ['Country', 'Year', 'GDP_Per_Capita']
gdp_df.head()

#### 7. How many countries have data for all years? Which countries are missing many years of data? Look at the number of observations per year. What do you notice?

In [None]:
# Count of the total years in the dataset
gdp_df['Year'].nunique()

In [None]:
# Number of years represented for each country
gdp_df_year_counts = gdp_df.groupby(['Country']).count().sort_values(by = 'Year')
gdp_df_year_counts

The above code looks complicated, but here is the corresponding sql query:<br><br>
    SELECT Country, COUNT(Year), COUNT(GDP_Per_Capita)<br>
    FROM gdp_df<br>
    GROUP BY Country<br>
    ORDER BY Year<br><br>

In [None]:
# Only the countries that have data for all years
gdp_df_year_counts[gdp_df_year_counts['Year'] == 33]

In [None]:
# Only the countries that do not have data for all years
gdp_df_year_counts[gdp_df_year_counts['Year'] < 33]

In [None]:
# Number of years (left) and the corresponding number of countries that have that many years of data (right)
gdp_df_year_counts['Year'].value_counts().sort_index()

Number of countries with data for every year: 202<br>
Number of countries missing "many" years of data: 15ish?<br>

#### 8. In this question, you're going to create some plots to show the distribution of GDP per capita for the year 2020. Go to the Python Graph Gallery (https://www.python-graph-gallery.com/) and look at the different types of plots under the Distribution section. Create a histogram, a density plot, a boxplot, and a violin plot. What do you notice when you look at these plots? How do the plots compare and what information can you get out of one type that you can't necessarily get out of the others?

- Histogram:

In [None]:
gdp_2020 = gdp_df[gdp_df['Year'] == 2020]

In [None]:
# I added some extra arguments for fine tuning, but the first argument is the only one necessary to create the plot
plt.hist(gdp_2020['GDP_Per_Capita'], bins = 20, edgecolor='black');

In [None]:
# This is the way the python graph gallery shows this being done
# By assigning fig and ax, you can have control over the 'fig' - in this case, we are changing the size
# subplots have many other uses, this is the simplest
fig, ax = plt.subplots(figsize = (12, 7))
ax.hist(gdp_df['GDP_Per_Capita'], bins = 20, edgecolor='black');

- Density plot:

In [None]:
# Again, the first argument is the only one that's mandatory
sns.kdeplot(gdp_df['GDP_Per_Capita'], fill = True, color = 'purple');

The y-axis here represents the probability that a given X-value will be at that range. It's a little mathy, but for the purposes of this exercise, just know that higher means that values in that range happen more often.

- Box plot:

In [None]:
sns.boxplot(y=gdp_df['GDP_Per_Capita']);

In [None]:
# I like this one because it looks like a cartoon (not a good reason to pick a visual)

sns.boxplot(y=gdp_df['GDP_Per_Capita'], linewidth=5);

- Violin plot:

In [None]:
sns.violinplot(y=gdp_df['GDP_Per_Capita']);

#### 9. What was the median GDP per capita value in 2020?

In [None]:
gdp_2020 = gdp_df[gdp_df['Year'] == 2020]
gdp_2020

In [None]:
gdp_2020['GDP_Per_Capita'].median()

Median GDP per cap in 2020: $13,358.00

#### 10. For this question, you're going to create some visualizations to compare GDP per capita values for the years 1990, 2000, 2010, and 2020. Start by subsetting your data to just these 4 years into a new DataFrame named gdp_decades.

In [None]:
gdp_decades = gdp_df[gdp_df['Year'].isin([1990,2000,2010,2020])]
gdp_decades

Using this, create the following 4 plots:

- A boxplot

In [None]:
sns.boxplot(y=gdp_decades['GDP_Per_Capita']);

- A barplot

In [None]:
sns.barplot(x=gdp_decades['Year'], y=gdp_decades['GDP_Per_Capita']);

- A scatterplot

In [None]:
sns.scatterplot(x=gdp_decades['Year'], y=gdp_decades['GDP_Per_Capita']);

- A scatter with trend line

In [None]:
sns.regplot(x=gdp_decades['Year'], y=gdp_decades['GDP_Per_Capita']);

#### 11. Which country was the first to have a GDP per capita greater than $100,000?

In [None]:
gdp_overtime = gdp_df.sort_values('Year')
gdp_overtime

In [None]:
gdp_overtime[gdp_overtime['GDP_Per_Capita'] >= 100000].reset_index(drop=True)

In [None]:
#gdp_100k = gdp_overtime[gdp_overtime['GDP_Per_Capita'] >= 100000].reset_index()
gdp100K.loc[0]

First country to hit $100k was United Arab Emirates in 1990

#### 12. Which country had the highest GDP per capita in 2020?

Using max:

In [None]:
gdp_2020['GDP_Per_Capita'].max()

In [None]:
max_gdp_2020 = gdp_2020['GDP_Per_Capita'].max()

In [None]:
gdp_2020[gdp_2020['GDP_Per_Capita'] == max_gdp_2020]

Using .idxmax:

In [None]:
gdp_2020.loc[gdp_2020['GDP_Per_Capita'].idxmax()]

Using nlargest:

In [None]:
gdp_df[gdp_df['Year'] == 2020].nlargest(1, 'GDP_Per_Capita')

#### Create a plot showing how this country's GDP per capita has changed over the timespan of the dataset.

In [None]:
gdp_df[gdp_df['Year'] == 2020].nlargest(1, 'GDP_Per_Capita')['Country'].tolist()[0]

In [None]:
top_country = gdp_df[gdp_df['Year'] == 2020].nlargest(1, 'GDP_Per_Capita')['Country'].tolist()[0]

gdp_df[gdp_df['Country'] == top_country].plot(x = 'Year', y = 'GDP_Per_Capita', kind = 'line');

#### 13. Which country had the lowest GDP per capita in 2020? Create a plot showing how this country's GDP per capita has changed over the timespan of the dataset.

In [None]:
gdp_df[gdp_df['Year'] == 2020].nsmallest(1, 'GDP_Per_Capita')['Country'].iloc[0]

In [None]:
gdp_df_2020_nsmallest = (
    gdp_df[gdp_df['Year'] == 2020]
    .nsmallest(1, 'GDP_Per_Capita')['Country']
    .iloc[0]
)

selected_country_df = gdp_df.loc[gdp_df['Country'] == gdp_df_2020_nsmallest]

# Plot the change in GDP per capita over time for the selected country
plt.plot(selected_country_df['Year'], selected_country_df['GDP_Per_Capita'])
plt.title(f"GDP per capita for {gdp_df_2020_nsmallest}")
plt.xlabel("Year")
plt.ylabel("GDP per capita ($)")
plt.ylim(ymin=0)

#### Bonus question: Is it true in general that coutries had a higher GDP per capita in 2020 than in 1990? Which countries had lower GDP per capita in 2020 than in 1990?

In [None]:
gdp_df[gdp_df['Year'] == 1990]

In [None]:
gdp_df[gdp_df['Year'] == 2020]

In [None]:
pd.merge(
    left = gdp_df[gdp_df['Year'] == 1990],
    right = gdp_df[gdp_df['Year'] == 2020],
    on = 'Country',
    suffixes = ['_1990', '_2020']
)

In [None]:
# creating a subset of just each year and merging them together with new endings for the columns that will have the same name from both tables
gdp_comparison = pd.merge(
    left = gdp_df[gdp_df['Year'] == 1990],
    right = gdp_df[gdp_df['Year'] == 2020],
    on = 'Country',
    suffixes = ['_1990', '_2020']
)

# creating a new column to indicate if 2020 was lower than 1990
gdp_comparison['2020_lower'] = gdp_comparison['GDP_Per_Capita_2020'] < gdp_comparison['GDP_Per_Capita_1990']
gdp_comparison

In [None]:
gdp_comparison['2020_lower'].value_counts()

In [None]:
gdp_comparison[gdp_comparison['2020_lower'] == True]

#### 14. Read in the internet use dataset into a DataFrame named internet_df. You will likely get errors when doing this. Check the arguments for the read_csv function to find ones that can help correct the errors (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [None]:
# nrows tells it to stop reading at the row right before things start getting weird
internet_df = pd.read_csv('../data/internet_use.csv', nrows = 6083)

#### Once you are able to read it in, take per a look at the top and bottom few rows to make sure that it has been read in correctly.

In [None]:
internet_df.head()

In [None]:
internet_df.tail()

####  Also, check the datatypes of the columns.

In [None]:
internet_df.dtypes

#### 15. Drop the Value Footnotes column and rename the remaining three to 'Country', 'Year', and 'Internet_Users_Pct'.

In [None]:
internet_df = internet_df.drop(columns = 'Value Footnotes')
internet_df.columns = ['Country', 'Year', 'Internet_Users_Pct']

#### 16. Look at the number of observations in this dataset per year. What do you notice?

In [None]:
internet_df.groupby('Year')['Country'].count().sort_index()

#### 17. What is the first year to have a non-zero internet users percentage value?

In [None]:
internet_df[internet_df['Internet_Users_Pct'] > 0].sort_values('Year')

In [None]:
greater_than_0 = internet_df[internet_df['Internet_Users_Pct'] > 0].sort_values('Year')
greater_than_0

#### 18. How does the distribution of internet users percent differ for 2000 and 2014?

In [None]:
data_2000 = internet_df[internet_df['Year'] == 2000]
data_2014 = internet_df[internet_df['Year'] == 2014]

plt.figure(figsize=(10, 6))

plt.hist(data_2000['Internet_Users_Pct'], alpha=0.5, bins=20, label='2000', color='purple', edgecolor='black')
plt.hist(data_2014['Internet_Users_Pct'], alpha=0.5, bins=20, label='2014', color='blue', edgecolor='black')

plt.xlabel("Internet Users Percentage")
plt.ylabel("Frequency")
plt.title("Distribution of Internet Users Percentage for 2000 and 2014")

plt.legend();

#### 19. For how many countries was the percentage of internet users below 5% in 2014?

In [None]:
internet_df.loc[(internet_df['Year'] == 2014) & (internet_df['Internet_Users_Pct'] < 5)]

In [None]:
below_5_2014 = internet_df.loc[(internet_df['Year'] == 2014) & (internet_df['Internet_Users_Pct'] < 5)]
below_5_2014

In [None]:
below_5_2014['Country'].count()

#### 20. Merge the two DataFrames to one. Do this in a way that keeps all rows from each of the two DataFrames. Call the new DataFrame gdp_and_internet_use.

In [None]:
gdp_and_internet_use = gdp_df.merge(internet_df, how='outer')

#### Look at the first and last few rows to confirm that it merged correctly.

In [None]:
gdp_and_internet_use

#### 21. Find the three countries with the highest internet users percentage in 2014.

In [None]:
gdp_and_internet_use[gdp_and_internet_use['Year'] == 2014].nlargest(3, 'Internet_Users_Pct')['Country'].to_list()

In [None]:
highest_int_2014 = (
    gdp_and_internet_use[gdp_and_internet_use['Year'] == 2014]
    .nlargest(3, 'Internet_Users_Pct')['Country']
    .to_list()
)

highest_int_2014

#### Use a seaborn FacetGrid (https://seaborn.pydata.org/generated/seaborn.FacetGrid.html) to compare how the GDP per capita has changed over time for these three countries. What do you notice?

In [None]:
top_3 = gdp_and_internet_use[gdp_and_internet_use['Country'].isin(highest_int_2014)]

g = sns.FacetGrid(top_3, col="Country", hue="Country", height=5, aspect=1.2)
g.map(sns.lineplot, "Year", "GDP_Per_Capita")

g.set_axis_labels("Year", "GDP per Capita")
g.set_titles(col_template="{col_name}")
g.set(ylim=(0, None));

#### 22. Subset gdp_and_internet_use to just the year 2014. Save this as a new dataframe named gdp_and_internet_use_2014.

In [None]:
gdp_and_internet_use_2014 = gdp_and_internet_use[gdp_and_internet_use['Year'] == 2014]

#### 23. Create a plot which compares Internet Users Percentage and GDP per Capita for the year 2014. What do you notice from this plot? If you see any unusual points, investigate them.

In [None]:
gdp_and_internet_use[gdp_and_internet_use['Year'] == 2014].plot(kind = 'scatter',
                                                               x = 'GDP_Per_Capita',
                                                               y = 'Internet_Users_Pct');

In [None]:
gdp_and_internet_use.loc[
    (gdp_and_internet_use['Year'] == 2014) &
    (gdp_and_internet_use['GDP_Per_Capita'] > 25000) &
    (gdp_and_internet_use['Internet_Users_Pct'] < 20)
]

In [None]:
gdp_and_internet_use.loc[
    (gdp_and_internet_use['Year'] == 2014) &
    (gdp_and_internet_use['GDP_Per_Capita'] > 110000) &
    (gdp_and_internet_use['Internet_Users_Pct'] < 96)
]

#### 24. Stretch Question: Use the qcut function from pandas (https://pandas.pydata.org/docs/reference/api/pandas.qcut.html) to divide countries in gdp_per_capita_2014 into three groups based on their GDP per capita values. Label these groups as "Low", "Medium", and "High". Put these labels in a new column, named "GDP_group".

In [None]:
gdp_and_internet_use_2014 = gdp_and_internet_use[gdp_and_internet_use['Year'] == 2014].copy()

gdp_and_internet_use_2014['GDP_group'] = pd.qcut(gdp_and_internet_use_2014['GDP_Per_Capita'],
        q = 3, labels = ['Low', 'Medium', 'High'])

gdp_and_internet_use_2014

#### 25. Stretch Question: How does the median internet users percentage compare for the three gdp groups?

In [None]:
gdp_and_internet_use_2014.groupby('GDP_group')['Internet_Users_Pct'].median()