# Lab 1 - Python, Pandas and Matplotlib
- **Author:** Suraj R. Nair ([suraj.nair@berkeley.edu](mailto:suraj.nair@berkeley.edu)) (Adapted from labs by Emily Aiken, Qutub Khan Vajihi and Dimitris Papadimitriou)
- **Date:** Feb 17, 2024
- **Course:** INFO 251: Applied Machine Learning

### Learning Objectives:

* Know what is good style when writing Python code
* Learn some useful Python features that you may not already know about
* Work with DataFrames using the Pandas library
* Produce basic graphs using the Matplotlib library, and learn some tips to produce readable and beautiful graphs

### Feedback:

After the lab, please provide feedback via this anonymous [google form](https://forms.gle/ZM8GaTf5Ejo544zY9). 


## 1. Python Code Style
Below are some key points for Python coding style. Most importantly, remember that code is for people to read --- and in this class, for people to grade --- so use your best judgement to make your code readable. 

For more information, visit Guido van Rossum's python style guide: [PEP 8 -- Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/). 

*Agenda: Line length, variable names, strings, whitespace, blank lines, comments, imports*

* **Line length**:
    Maximum line length is 79 characters. As a rule of thumb, in Jupyter notebooks, just don't go over the length of the box on a laptop screen. If you have a very long line of code, you can break it using a backslash.
    




* **Variable names:** Make variable names (nouns) and function names (verbs) descriptive.

In [None]:
# Correct
hyperparameter_grid = {1, 2, 3}
no_of_iterations = 50

# Incorrect
a = 12
var = 10

* **Strings:** Be consistent between ' and ".

* **Whitespace:** 
Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not). Avoid extraneous whitespaces immediately inside parentheses, brackets, or braces

In [None]:
# Correct:
i = 0
i = i + 1
i += 1
lst = [0, 1, 2]
tple = (0, 1, 2)
st = {0, 1, 2}
print(lst[0])

# Incorrect:
i=0
i+=1
lst = [ 0, 1, 2 ]
tple = ( 0, 1, 2 )
st = { 0, 1, 2 }
print( lst[ 0 ] )

* **Blank lines:**
Maintain two lines between all top-level things (functions, classes, imports, etc)

In [None]:
import numpy 
import pandas

def foo(x):
    if x >= 0:
        return math.sqrt(x)
    else:
        return None

def bar(x):
    if x < 0:
        return None
    return math.sqrt(x)

* **Comments:**
    For readability, try to always explain the functionality of your lines by commenting. Comments can come before blocks of code, or inline for single lines of code.
    
* **Docstrings:**
    Docstrings are used to document a Python module, class, function or method. For consistency, always use """triple double quotes""" around docstrings. 

In [None]:
# Creating a dictionary and inverting it
my_map = {'INFO 251':0,'Lab':1}
inv_map = {v: k for k, v in my_map.items()} #inverting the dict

# Printing and returning the dictionary
print(inv_map)
inv_map

# Inline comments
y = 94720
y = y + 1   # Increment y. This is an inline comment. Use sparingly. 

# Docstrings
def foo(x):
    """Given a positive number, this function returns its square root""" 
    if x >= 0:
        sqrt = math.sqrt(x)
        return sqrt
    else:
        return None    
    
def foo(x):
    """ Parameters
        ----------
        x : numeric
            If positive, return the square root

        Returns
        -------
        sqrt : numeric
        
        *Note that in a multiline docstring, the closing quotes are on a separate line
    """
    if x >= 0:
        return math.sqrt(x)
    else:
        return None  





* **Imports** - Imports from the same class/package should be on the same line. Imports from different classes/packages should be on different lines. 

In [None]:
# Correct:
import pandas
import matplotlib

# Wrong:
import pandas, matplotlib

# Correct:
from sklearn.metrics import r2_score, roc_auc_score

## 2. Some Useful Python Features
*Agenda: Reading/writing files, file paths, enumerate, lambda functions, zip*

* **Reading and writing files:** Use "with" to open the files, which will make sure the files are closed automatically

In [None]:
# Use "with" to open files...
with open('test.txt', 'r') as f:
    for line in f:
        print(line)

In [None]:
# ...otherwise you explicitly need to 'open' and 'close' files.
f = open('test.txt', 'r')
for line in f:
    print(line)

f.close()

* **File paths:** Concatenate path parts with **os.path.join** rather than with string concatenation

In [None]:
import os

# Correct
country_name = 'USA'
month = 'January'
path = os.path.join('a', 'b', country_name, month)
print(path)

# Less correct
path = 'a/b/' + country_name + '/' + month
print(path)

* **Enumerate**: great for getting index and elements of an iterator at the same time. 

In [None]:
# Use enumerate to get the index (which comes first) and the element (which comes second) at the same time...
for i, x in enumerate([1, 2, 3]):
    print('Index:', i)
    print('Element:', x)

In [None]:
# ...otherwise you'll have to use a "flag variable", which isn't very elegant
flag = 0
for x in [1, 2, 3]:
    print('Index:', flag)
    print('Element:', x)
    flag += 1

* **Lambda functions**: A Lambda Function is a small, anonymous function — anonymous in the sense that it doesn’t actually have a name. Lambda functions are used a lot with pandas.

In [None]:
# Lambda function with one variable
x = lambda a : a*3 + 3
print(x(3)) # prints '12'

# A less elegant way to code up the function with one variable
def my_function(a):
    return a*3 + 3
print(my_function(3))

In [None]:
# Lambda function with two variables
x = lambda a, b : a * b
print(x(5, 6)) # prints '30'

* **Zipping**: The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc. If the passed iterators have different lengths, the iterator with the least items decides the length of the new iterator. (https://www.w3schools.com/python/ref_func_zip.asp). It's a great way of pairing together two lists.

In [None]:
# Exampling zipping with two lists of the same length
products = ['bread', 'eggs', 'gas']
prices = [4.99, 6, 5.60]

for product, price in zip(products, prices):
    print('Product: {}, Price: {}'.format(product, price))

In [None]:
# Exampling zipping with two lists of different lengths -- the resulting zip object is the shorter length
products = ['table', 'chair', 'sofa', 'bed']
prices = [50, 20, 200]

for product, price in zip(products, prices):
    print('Product: {}, Price: {}'.format(product, price))

**Note**: Take a look at the [itertools](https://docs.python.org/3/library/itertools.html) module, which has more tools and functions for efficient looping

## 3. Pandas
*Agenda: Data loading, viewing, selection, grouping/aggregation, mutation, string operations*

#### 3.1 Load the data

In [None]:
import pandas as pd

# Loading a csv
gap_df = pd.read_csv('gapminder.csv')

# # Other loading tricks
gap_df1 = pd.read_csv('gapminder.csv', nrows=10) # Useful when dataset has too many observations to fit in memory
gap_df2 = pd.read_csv('gapminder.csv', usecols=['country', 'continent', 'population']) # When you want to keep a few columns

Variable Dictionary
- country: country name
- year: year (YYYY format)
- population: total population 
- continent: excludes Antarctica
- life_exp: Life Expectancy (yrs)
- gpd_cap: GDP per Capita

#### 3.2 Viewing the data

In [None]:
# Dimensions of the dataframe
gap_df.shape

In [None]:
# Display first 10 rows
gap_df.head(10)

In [None]:
# Display last 5 rows
gap_df.tail(5)

In [None]:
# Dataset types
gap_df.dtypes

In [None]:
# Quick descriptive stats
gap_df.describe()

#Use the include/ exclude arguments to keep/ remove certain types of columns
#gap_df.describe(include = "all")
#gap_df.describe(exclude = "O") #removes columns of dtype "object"

#### 3.3 Missing values

In [None]:
## Use the .isna() method to check if a variable/ column has missing values

#gap_df.isna()

gap_df.isna().sum()



#### 3.4 Selection

There are multiple ways to select data from a pandas dataframe. Here are a few options...

In [None]:
# Select columns using the double bracket notation
gap_df[['year', 'population']] 

In [None]:
# Select multiple rows by indexing in as though the dataframe were a list -- note that this notation ignores 
# the dataframe's index
gap_df[0:3]

In [None]:
# Select a row or rows using "loc", which corresponds to the value of the index
gap_df.loc[0]
#gap_df.loc[0:2]
#gap_df.loc[0:2, ['year','country']]

In [None]:
# Select a row or rows using "iloc", which ignores the index 
gap_df_adjusted_index = gap_df.copy()
gap_df_adjusted_index.index = gap_df_adjusted_index.index + 3
gap_df.iloc[0]
#gap_df.iloc[0:2]
#gap_df.iloc[0:2, 3:5] # Note that iloc also using numbers to denote the columns selected


In [None]:
# Filtering by column values
gap_df[(gap_df['life_exp'] < 70) & (gap_df['year'] == 2002)]

In [None]:
# Selecting unique values from a single columns
gap_df['continent'].unique()

In [None]:
# Selecting unique values from a set of columns
gap_df[['country', 'continent']].drop_duplicates()

#### 3.5 Aggregation and Grouping

In [None]:
# Basic aggregations to the entire dataframe: You can just apply the functions directly.
gap_df.mean(numeric_only = True)
#gap_df.std()
#gap_df.min()
#gap_df.max()
#gap_df.median()
#gap_df.mode()

In [None]:
# Grouping: Creates a "pandas groupby" object
gap_df.groupby('continent')

In [None]:
# Grouped aggregations: Apply aggregations to the "pandas groupby" object
gap_df.groupby('country').mean(numeric_only = True)
gap_df.groupby('country').agg('mean') # Equivalent syntax

In [None]:
# Tip: Use as_index=False in groupby to keep the groups as a regular column
gap_df.groupby('country', as_index=False).mean(numeric_only = True)

#### 3.6 Mutations

In [None]:
# Basic functions of one or more columns: Use the intuitive syntax
gap_df['life_exp_increment'] = gap_df['life_exp'] + 1
gap_df['life_exp_times_population'] = gap_df['life_exp']*gap_df['population']

In [None]:
# More complex functions: Use "apply"
gap_df['life_exp_string'] = gap_df['life_exp'].apply(lambda x: str(x) + ' yrs') # Apply with a single column

# Apply using multiple columns -- much slower, and don't forget the "axis=1"
gap_df['country_year'] = gap_df.apply(lambda row: row['country'] + ' - ' + str(row['year']), axis=1)

gap_df.head()

In [None]:
# Map is occasionally an elegant alternative to apply
# A map is just a dictionary mapping an input value to an output value

continent_map = {'Asia':1, 'Africa':2, 'Europe':3, 'Americas':4, 'Oceania':5} 
gap_df['continent_val'] = gap_df['continent'].map(continent_map)
gap_df

In [None]:
# Frequency counts
gap_df['country'].value_counts().sort_index()

#### 3.7 String Operations

Full list of pandas functions for string handling [here](https://pandas.pydata.org/docs/reference/series.html#string-handling)

In [None]:
# Make upper case
gap_df['country'].str.upper()

In [None]:
# Make lower case
gap_df['country'].str.lower()

## 4. Matplotlib


The matplotlib [gallery](https://matplotlib.org/2.0.2/gallery.html) has examples of various kinds of charts you can create(along with code snippets). 

In [None]:
# Enable inline plotting of matplotlib figures, and import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

#### 4.1 Boxplots

In [None]:
plt.figure(figsize=(5, 5))
plt.boxplot(gap_df['life_exp'])
plt.show()

#### 4.2 Histograms

In [None]:
plt.figure()
plt.hist(gap_df['gdp_cap'], color='Red')
plt.show()

#### 4.3 Scatter Plots

In [None]:
plt.figure()
plt.scatter(gap_df['life_exp'], gap_df['gdp_cap'], alpha=0.2) # Low alpha makes overlapping markers readable
plt.show()

#### 4.4 Bar Plots

In [None]:
plt.figure()
plt.barh(gap_df['continent'], gap_df['gdp_cap'])
plt.show()

#### 4.5 Making plots readable: Title, axes labeling, legends, subplots and more

In [None]:
# Title, axis labels, and legend
plt.figure()
plt.scatter(gap_df['life_exp'], gap_df['gdp_cap'], alpha=0.2, label='Observations (N=%i)' % len(gap_df)) # Label for legend

plt.title('GDP Per Capita v/s Life Expectancy', fontsize='x-large')
##plt.xlabel('Life Expectancy (Years)')
#plt.ylabel('GDP Per Capita (USD)')
# plt.legend(loc='best')

# The right and top spines are ugly -- let's remove them
#plt.gca().spines['top'].set_visible(False)
#plt.gca().spines['right'].set_visible(False)

plt.show()

In [None]:
# Subplots
fig, ax = plt.subplots(2, 2, figsize=(10, 6))
ax = ax.flatten() # Turns the axes object into a 1D array instead of a 2D array -- convenient for indexing

ax[0].boxplot(gap_df['gdp_cap'].dropna())
ax[1].hist(gap_df['life_exp'], color='Red')
ax[2].scatter(gap_df['life_exp'], gap_df['gdp_cap'], alpha=0.2)
ax[3].barh(gap_df['continent'], gap_df['population'])

# Note that the syntax for the title is slightly different for subplots. Syntax is likewise a little different
# for setting axis labels.
ax[0].set_title('Distribution of GDP Per Capita')
ax[1].set_title('Distribution of Life Expectancy')
ax[2].set_title('GDP Per Capita v/s Life Expectancy')
ax[3].set_title('Population by Continent')

# Again, turning off the top and right spines
for a in range(len(ax)):
    ax[a].spines['top'].set_visible(False)
    ax[a].spines['right'].set_visible(False)

plt.tight_layout() # Always use tight_layout to maximize space in the plot
plt.show()

#### 4.6 Making plots beautiful: Seaborn

In [None]:
import seaborn as sns
sns.set(font_scale=1.5) # Convenient way to set the font scale for all parts of the plot at the same time

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].scatter(gap_df['life_exp'], gap_df['gdp_cap'], alpha=0.2)
ax[1].hist(gap_df['life_exp'], color='Red')

plt.show()

In [None]:
# Uncomment and run this line below to remove the grids and grey background for the plots
# sns.set_style("whitegrid", {'axes.grid' : False}) 

In [None]:
fig, ax = plt.subplots( figsize=(5, 5))

# Use the hue argument to shade the points in this scatterplot by a categorical variable
sns.scatterplot(data = gap_df, x = 'life_exp', y = 'gdp_cap', hue = 'continent', alpha=0.2) 
plt.legend(bbox_to_anchor = (1.05, 1), fontsize = 12) #bbox_to_anchor allows you to manually move the legend
plt.xlabel("Life Expectancy", fontsize = 13) #font size can be set separately for each piece of text
plt.ylabel("GDP Per Capita (USD)", fontsize = 14)
sns.despine() #get rid  of the top and right axis
plt.show()

**Bonus**: Seaborn also has beautiful built-in plots. If there is time, try experimenting with any of the following plots from seaborn using the gap_df data: [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html), [violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html), or [kernel density estimate](https://seaborn.pydata.org/generated/seaborn.kdeplot.html). 

## 5. Bonus: Some Pandas Excercise Questions

#### (Adapted from Introduction to Statistical Learning, James et al. (2013))


Using the 'gapminder.csv' dataset that we utilized earlier, try to answer the below questions - 

a) Which variables are quantitative and which are qualitative?

Write you answer below - 

b) What is the *range* of **life_exp**?

d) What is the mean and standard deviation of **population** and **gdp_cap**?

e) Now remove observations from the continent "Oceania", and for the remaining data report the min,max, mean, and standard deviation of **life_exp**.

f) For each year in the dataset, identify the country with the maximum GDP per capita.