# Lab 1 - Python, Pandas and Numpy
- **Author:** Qutub Khan Vajihi ([qutubkhan.vajihi@berkeley.edu](mailto:qutubkhan.vajihi@berkeley.edu)) (Adapted from labs by Dimitris Papadimitriou)
- **Date:** 27 January 2021
- **Course:** INFO 251: Applied Machine Learning

### Learning Objectives:

* Know what is good style when writing Python code
* Learn some useful Python features
* Work with DataFrames using the Pandas library
* Produce basic graphs using the Matplotlib library
* Understand numpy, matrix operations and iterations

## 1. Python Code Style
Guido van Rossum is one of the authors of the PEP8 style guide [PEP 8 -- Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/). Below are some references from this guide - 

"One of Guido's key insights is that code is read much more often than it is written. The guidelines provided here are intended to improve the readability of code and make it consistent across the wide spectrum of Python code. As PEP 20 says, "Readability counts"."

Below are a few points from this guide which we think are useful to note, but feel free to refer to the guide for much more detailed guidelines. 

* Line length:
    Maximum line length is 79 characters.

* Names: Make variable names (nouns) and function names (verbs) descriptive.

* Be consistent between ' and ".

* 1e6 or 10 ** 6 is more readable than 1000000.

* Whitespace: 
Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not).

In [None]:
# Notice the whitespaces on either sides of operators
i = 0
submitted = 0
i = i + 1
submitted += 1
print(submitted)

* Avoid extraneous whitespaces immediately inside parentheses, brackets or braces.

In [None]:
# Correct:
ham = [0,1,2]
print(ham[0])

# Wrong:
print(ham[ 0 ])

* Blank lines:
Maintain two lines between all top-level things (functions, classes, imports, etc)

In [None]:
def foo(x):
    if x >= 0:
        return math.sqrt(x)
    else:
        return None


def bar(x):
    if x < 0:
        return None
    return math.sqrt(x)

* Comments:
    For readability, try to always explain the functionality of your lines by commenting

In [None]:
my_map = {'AML':0,'Lab':1}
inv_map = {v: k for k, v in my_map.items()} 
inv_map

* Imports - Should usually be on separate lines

In [None]:
# Correct:
import os
import sys

# Wrong:
import sys, os

# Correct:
from subprocess import Popen, PIPE

#### Rule of thumb-
- **Code is for people to read.**
- **Use your best judgment**


For more reference - https://towardsdatascience.com/an-overview-of-the-pep-8-style-guide-5672459c7682

## 2. Some Useful Python Features

* Use **with** to open files, which can ensure the files are closed.

In [None]:
with open('test.txt', 'r') as f:
    for line in f:
        print(line)

In [None]:
# Otherwise you explicitly need to 'open' and 'close' files.
f = open('test.txt', 'r')
for line in f:
    print(line)

f.close()

* Concatenate path parts with **os.path.join**

In [None]:
import os
country_name = 'USA'
month = 'January'
path = os.path.join('a', 'b', country_name, month)
print(path)

In [None]:
path = 'a/b/' + country_name + '/' + month
print(path)

* **enumerate** is great for getting index and elements of an iterator at the same time. 
    - It yields the elements of an iterator, as well as an index number.

In [None]:
for i, x in enumerate([1, 2, 3]):
    print('Index:', i)
    print('Element:', x)

In [None]:
# Otherwise this is the standard approach, using a Flag! 
flag = 0
for x in [1, 2, 3]:
    print('Index:', flag)
    print('Element:',x)
    flag += 1

* **lambda** - A Lambda Function is a small, anonymous function — anonymous in the sense that it doesn’t actually have a name.

In [None]:
x = lambda a, b : a * b
print(x(5, 6)) # prints '30'

x = lambda a : a*3 + 3
print(x(3)) # prints '12'

* **zip** - The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc. If the passed iterators have different lengths, the iterator with the least items decides the length of the new iterator. (https://www.w3schools.com/python/ref_func_zip.asp)

In [None]:
products = ['table', 'chair', 'sofa', 'bed']
prices = [50, 20, 200, 150]

for product, price in zip(products, prices):
    print('Product: {}, Price: {}'.format(product, price))

## 3. Working with Data!

### 3.1 Pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.




#### Load the data

In [None]:
import pandas as pd
# loading a csv
auto_df = pd.read_csv('Auto.csv')

- mpg: miles per gallon
- cylinders: Number of cylinders between 4 and 8
- displacement: Engine displacement (cu. inches)
- horsepower: Engine horsepower
- weight: Vehicle weight (lbs.)
- acceleration: Time to accelerate from 0 to 60 mph (sec.)
- year: Model year (modulo 100)
- origin: Origin of car (1. American, 2. European, 3. Japanese)
- name: Vehicle name

In [None]:
# dimensions of the dataframe
auto_df.shape

#### Viewing the data

In [None]:
# display first few rows
display(auto_df.head())

# display last few rows
auto_df.tail()

In [None]:
auto_df.index

In [None]:
auto_df.columns

In [None]:
auto_df.to_numpy() # relatively expensive because of multiple dtypes

In [None]:
auto_df[['mpg','cylinders']].to_numpy() # much faster

In [None]:
# convenient for quick descriptive stats! 
auto_df.describe()

In [None]:
auto_df.dtypes

In [None]:
auto_df.info()

#### Selection

There are multiple ways, let's look at some common ways - 

In [None]:
auto_df['mpg']

In [None]:
auto_df[0:3]

In [None]:
auto_df.loc[2, ['year','name']]

In [None]:
auto_df.loc[2, ['year','name']].to_frame()

In [None]:
auto_df.iloc[0:2,0:5]

In [None]:
auto_df[auto_df['mpg'] < 18.0]

#### Grouping

In [None]:
auto_origin = auto_df.groupby('origin')

In [None]:
auto_origin

In [None]:
auto_origin.mean()

In [None]:
auto_origin.max()

#### Pivot Tables

In [None]:
pd.pivot_table(
    auto_df,
    values = 'acceleration',
    index = ['cylinders'],
    columns = ['year'],
    aggfunc = max
    )

#### Other very useful functions - 

*Apply* : 

In [None]:
auto_df[['mpg','cylinders']].apply(lambda x : x.max() - x.min())

In [None]:
auto_df[['mpg','cylinders']].apply(lambda x : x.max() + 98)

*Map* : 

In [None]:
cust_map = {8:0,4:1}
auto_df['cylinders_new'] = auto_df['cylinders'].map(cust_map)
auto_df

*String operations* - 

In [None]:
auto_df['name'].str.upper()

In [None]:
auto_df['name'].str.split(' ')

### 3.2 Matplotlib

#### Enable inline printing of matplotlib plots

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt

#### Boxplots

In [None]:
plt.figure(figsize=(10, 7))
plt.boxplot(auto_df['mpg'])
plt.show()

#### Histograms

In [None]:
plt.figure()
plt.hist(auto_df['cylinders'], color='Red')
plt.show()

#### Scatter Plots

In [None]:
plt.figure()
plt.scatter(auto_df['mpg'], auto_df['weight'], alpha=0.2)
plt.show()

**Bar plots**

In [None]:
plt.figure()
plt.barh(auto_df['year'], auto_df['weight'])
plt.show()

#### Scatter Matrix

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
plt.figure()
scatter_matrix(auto_df, alpha=0.2, figsize=(12, 12), diagonal='kde')
plt.show()

#### 3d Plots

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

def randrange(n, vmin, vmax):
    '''
    Helper function to make an array of random numbers having shape (n, )
    with each number distributed Uniform(vmin, vmax).
    '''
    return (vmax - vmin)*np.random.rand(n) + vmin

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

n = 100

# For each set of style and range settings, plot n random points in the box
# defined by x in [23, 32], y in [0, 100], z in [zlow, zhigh].
for c, m, zlow, zhigh in [('r', 'o', -50, -25), ('b', '^', -30, -5)]:
    xs = randrange(n, 23, 32)
    ys = randrange(n, 0, 100)
    zs = randrange(n, zlow, zhigh)
    ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

plt.show()

In [None]:
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np
from matplotlib import cm

fig = plt.figure()
ax = fig.gca(projection='3d')

# Make data.
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)

# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
                       linewidth=0, antialiased=False)

# Customize the z axis.
ax.set_zlim(-1.01, 1.01)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)

plt.show()


## 4. NumPy

NumPy is an open-source numerical Python library. It contains a multi-dimensional array and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays such as trigonometric, statistical, and algebraic routines.

**Why use it?**

NumPy aims to provide an array object that is up to 50x faster than traditional Python lists!
The array object in NumPy is called *ndarray*, it provides a lot of supporting functions that make working with ndarray very easy.




In [None]:
import numpy as np

In [None]:
python_list = [1, 1, 2, 2, 4, 5, 6, 5, 1]
numpy_list = np.array([2,3,1,0])
numpy_list_from_python_list = np.array(python_list)
display(numpy_list)
numpy_list_from_python_list

In [None]:
numpy_list.shape

In [None]:
numpy_list.ndim

In [None]:
another_numpy_list = np.arange(20).reshape(4,5)
another_numpy_list

In [None]:
display(another_numpy_list.shape)
another_numpy_list.ndim

**Vectorization**

In [None]:
li_a = np.array([1,2,3])
li_b = np.array([1,1,1])
for i in range(len(li_a)):
    print(li_a[i] * li_b[i])

In [None]:
# NumPy uses vectorization to optimize this. It essentially delegates the loop
# to pre-compiled, optimized C code under the hood
li_a * li_b

In [None]:
li_a + li_b

In [None]:
li_a / li_b

**Linear Algebra & Matrix Operations**

In [None]:
a = np.array([[1.0, 2.0], [3.0, 4.0]])
a

In [None]:
a.transpose()

In [None]:
np.linalg.inv(a)

In [None]:
np.eye(3)

In [None]:
A = np.random.rand(5,5)
print(A)

In [None]:
np.linalg.norm(A,'fro') #Frobenius norm

In [None]:
B = np.random.rand(5,5)
B

In [None]:
print(np.matmul(A,B))# conventional matrix multiplication
print(A@B)# conventional matrix multiplication
print()
print(np.multiply(A,B))# element-wise
print(A*B)# element-wise


**Other Useful Operations/Functions**

In [None]:
np.linspace(2.0, 3.0, num=5)

In [None]:
np.linspace(2.0, 3.0, num=10, endpoint=False)

In [None]:
a = np.arange(10)
np.where(a < 5, a, 10*a)

In [None]:
np.full((2, 2), 10)

In [None]:
np.concatenate((np.linspace(1.0, 2.0, num=5),np.linspace(3.0, 4.0, num=5)))

For more references - https://numpy.org/devdocs/user/quickstart.html

## Some Excercise Questions

#### (Adapted from Introduction to Statistical Learning, James et al. (2013))


Using the 'Auto.csv' dataset that we utilized earlier, try to answer the below questions - 

a) Are there missing values? Show atleast two ways to check for null values in a dataframe.

b) Which predictors are quantitative and which are qualitative?

Write you answer below - 

c) What is the *range* of **mpg** and **cylinders**?

d) What is the mean and standard deviation of **weight** and **acceleration**?

e) Now remove the 10th through 85th observations, and for the remaining data report the min,max, mean, and standard deviation of **mpg**.

f) What is max weight per year?

**Some NumPy**

a) Initialize two random numpy matrices of 2x2 shape and stack them horizontally.

b) Initialize a 3x3 numpy matrix filled with 1's, and another 3x3 numpy matrix with 1's on the diagonal. Add both the matrices and replace all 1's with the value '4'.