# Lab 1 - Python, Pandas and Matplotlib
- **Author:** Suraj R. Nair ([suraj.nair@berkeley.edu](mailto:suraj.nair@berkeley.edu)) (Adapted from labs by Emily Aiken, Qutub Khan Vajihi and Dimitris Papadimitriou)
- **Date:** Feb 17, 2024
- **Course:** INFO 251: Applied Machine Learning

**Bonus**: Seaborn also has beautiful built-in plots. If there is time, try experimenting with any of the following plots from seaborn using the gap_df data: [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html), [violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html), or [kernel density estimate](https://seaborn.pydata.org/generated/seaborn.kdeplot.html). 

## 5. Bonus: Some Pandas Excercise Questions

#### (Adapted from Introduction to Statistical Learning, James et al. (2013))


Using the 'gapminder.csv' dataset that we utilized earlier, try to answer the below questions - 

In [12]:
import pandas as pd


## Load DataFrame
df_gap = pd.read_csv("gapminder.csv")

a) Which variables are quantitative and which are qualitative?

In [13]:
df_gap.dtypes ## Use dtypes to examine the variable types

country        object
year            int64
population      int64
continent      object
life_exp      float64
gdp_cap       float64
dtype: object

Write your answer below - 

Answer:

- country and continent are qualitative
- year, population, life_exp a

b) What is the *range* of **life_exp**?

In [19]:
print(f"Life expectancy ranges from {df_gap['life_exp'].min()} to {df_gap['life_exp'].max()}.")

Life expectancy ranges from 23.599 to 82.603.


d) What is the mean and standard deviation of **population** and **gdp_cap**?

In [17]:
df_gap[['population', 'gdp_cap']].describe().loc[['mean', 'std']].round(0) ## rounding to zero for readability

Unnamed: 0,population,gdp_cap
mean,29601212.0,7156.0
std,106157897.0,9535.0


e) Now remove observations from the continent "Oceania", and for the remaining data report the min,max, mean, and standard deviation of **life_exp**.

In [22]:
df_gap_subset = df_gap[df_gap['continent'] != "Oceania"].copy() # Remove Oceania
df_gap_subset[['life_exp']].describe().loc[['min', 'max', 'mean', 'std']] 

Unnamed: 0,life_exp
min,23.599
max,82.603
mean,59.262271
std,12.87794


f) For each year in the dataset, identify the country with the maximum GDP per capita.

In [32]:
#First, get the max GDP for each year

# Many ways to do this. 
# Here, we use the groupby and then transform, storing the yearly max for gdp_cap in a separate column
# We then find the corresponding rows, and sort

df_gap['max_gdp'] = df_gap.groupby('year')['gdp_cap'].transform('max') 
df_gap[df_gap['gdp_cap'] == df_gap['max_gdp']][['year', 'country', 'max_gdp']].sort_values('year')

Unnamed: 0,year,country,max_gdp
1476,1952,Switzerland,14734.23275
853,1957,Kuwait,113523.1329
854,1962,Kuwait,95458.11176
855,1967,Kuwait,80894.88326
856,1972,Kuwait,109347.867
857,1977,Kuwait,59265.47714
1314,1982,Saudi Arabia,33693.17525
1147,1987,Norway,31540.9748
860,1992,Kuwait,34932.91959
1149,1997,Norway,41283.16433


In [34]:
# Same output, using a single line of code (without creating a separate column for max)
# We set the country as the index, and then use idxmax to directly obtain the index for the largest value in each year
df_gap.set_index('country').groupby('year')['gdp_cap'].idxmax()

year
1952     Switzerland
1957          Kuwait
1962          Kuwait
1967          Kuwait
1972          Kuwait
1977          Kuwait
1982    Saudi Arabia
1987          Norway
1992          Kuwait
1997          Norway
2002          Norway
2007          Norway
Name: gdp_cap, dtype: object