# Python, Pandas, and Seaborn Refresher 

In this module we will do a quick review of some of syntax that you will be using throughout this workshop. Remember that if you feel unfamiliar with these syntax and you need more practice with the concepts, there are additional optional notebooks in your project for each one of these libraries that we cover here.

## 1. Python Refresher

In this section we will perform a few exercises to refresh our memory of how python works.

### 1.1 Data Types

Everything in Python is an **object** and every object in Python has a **type**. Some of the basic types include:

- **`int`** (integer; a whole number with no decimal place)
  - `10`
  - `-3`
- **`float`** (float; a number that has a decimal place)
  - `7.41`
  - `-0.006`
- **`str`** (string; a sequence of characters enclosed in single quotes, double quotes, or triple quotes)
  - `'this is a string using single quotes'`
  - `"this is a string using double quotes"`
  - `'''this is a triple quoted string using single quotes'''`
  - `"""this is a triple quoted string using double quotes"""`
- **`bool`** (boolean; a binary value that is either true or false)
  - `True`
  - `False`
- **`NoneType`** (a special type representing the absence of a value)
  - `None`

In Python, a **variable** is a name you specify in your code that maps to a particular **object**, object **instance**, or value.

By defining variables, we can refer to things by names that make sense to us. Names for variables can only contain letters, underscores (`_`), or numbers (no spaces, dashes, or other characters). Variable names must start with a letter or underscore.

<hr>

### 1.2 Basic operators <a name="operators"></a>

In Python, there are different types of **operators** (special symbols) that operate on different values. Some of the basic operators include:

- arithmetic operators
  - **`+`** (addition)
  - **`-`** (subtraction)
  - **`*`** (multiplication)
  - **`/`** (division)
  - __`**`__ (exponent)
- assignment operators
  - **`=`** (assign a value)
  - **`+=`** (add and re-assign; increment)
  - **`-=`** (subtract and re-assign; decrement)
  - **`*=`** (multiply and re-assign)
- comparison operators (return either `True` or `False`)
  - **`==`** (equal to)
  - **`!=`** (not equal to)
  - **`<`** (less than)
  - **`<=`** (less than or equal to)
  - **`>`** (greater than)
  - **`>=`** (greater than or equal to)

When multiple operators are used in a single expression, **operator precedence** determines which parts of the expression are evaluated in which order. Operators with higher precedence are evaluated first (like PEMDAS in math). Operators with the same precedence are evaluated from left to right.

- `()` parentheses, for grouping
- `**` exponent
- `*`, `/` multiplication and division
- `+`, `-` addition and subtraction
- `==`, `!=`, `<`, `<=`, `>`, `>=` comparisons

> See https://docs.python.org/3/reference/expressions.html#operator-precedence

<hr>

### 1.3 Examples
Run the following cells to see how basic containers work

In [None]:
# Assigning some numbers to different variables
num1 = 10
num2 = -3
num3 = 7.41

In [None]:
# Exponent
num2 ** num1

In [None]:
# Assign the value of an expression to a variable
num8 = num1 + num2 * num3
num8

In [None]:
# Assign some strings to different variables
simple_string1 = 'an example'
simple_string2 = "oranges "

### 1.4 Basic containers <a name="containers"></a>

> Note: **mutable** objects can be modified after creation and **immutable** objects cannot.

Containers are objects that can be used to group other objects together. The basic container types include:

- **`str`** (string: immutable; indexed by integers; items are stored in the order they were added)
- **`list`** (list: mutable; indexed by integers; items are stored in the order they were added)
  - `[3, 5, 6, 3, 'dog', 'cat', False]`
- **`tuple`** (tuple: immutable; indexed by integers; items are stored in the order they were added)
  - `(3, 5, 6, 3, 'dog', 'cat', False)`
- **`set`** (set: mutable; not indexed at all; items are NOT stored in the order they were added; can only contain immutable objects; does NOT contain duplicate objects)
  - `{3, 5, 6, 3, 'dog', 'cat', False}`
- **`dict`** (dictionary: mutable; key-value pairs are indexed by immutable keys; items are NOT stored in the order they were added)
  - `{'name': 'Jane', 'age': 23, 'fav_foods': ['pizza', 'fruit', 'fish']}`

When defining lists, tuples, or sets, use commas (,) to separate the individual items. When defining dicts, use a colon (:) to separate keys from values and commas (,) to separate the key-value pairs.

Strings, lists, and tuples are all **sequence types** that can use the `+`, `*`, `+=`, and `*=` operators.

#### Examples
Run the following cells to see how basic containers work

In [None]:
# Assign some containers to different variables
list1 = [3, 5, 6, 3, 'dog', 'cat', False]
tuple1 = (3, 5, 6, 3, 'dog', 'cat', False)
set1 = {3, 5, 6, 3, 'dog', 'cat', False}
dict1 = {'name': 'Jane', 'age': 23, 'fav_foods': ['pizza', 'fruit', 'fish']}

In [None]:
# Multiply
[1, 2, 3, 4] * 2

In [None]:
# Add and re-assign
list1 += [5, 'grapes']
list1

### 1.5 Accessing data in containers <a name="data"></a>

For strings, lists, tuples, and dicts, we can use **subscript notation** (square brackets) to access data at an index.

- strings, lists, and tuples are indexed by integers, **starting at 0** for first item
  - these sequence types also support accesing a range of items, known as **slicing**
  - use **negative indexing** to start at the back of the sequence
- dicts are indexed by their keys

> Note: sets are not indexed, so we cannot use subscript notation to access data elements.

#### Examples
Run the following cells to see how data in containers work

In [None]:
# Access the first item in a sequence
list1[0]

In [None]:
# Access a range of items in a sequence
simple_string1[3:8]

In [None]:
# Access an item in a dictionary
dict1['name']

#### Exercises
Attempt to solve the problems in the comments below. You can load the [Answers](#answers-5.2) when you are ready to check your work

In [None]:
# 5.2.1 Create some containers
list2 = ['car', 'plane', 'boat', 'train']
tuple2 = ('first', 'second', 'third', 'fourth')
dict2 = {'CPU':'Arm', 'Network':'WiFi', 'Storage':['SSD', 'Sata', 'Tape']}

In [None]:
# 5.2.2 What is the first item in list2?


In [None]:
# 5.2.3 What is the last item in list2?


In [None]:
# 5.2.4 What are the first 3 items in tuple2?


In [None]:
# 5.2.5 What is the value of dict2 for the CPU?


In [None]:
# 5.2.6 What is the first type of Storage in dict2?


#### Answers
Run the cell below to get the answers to the exercises above

In [None]:
# %load https://raw.githubusercontent.com/IBM/python-and-analytics/master/data/answers/python3-answers-5.2.py

### 1.6 Python built-in functions and callables <a name="builtin"></a>

A **function** is a Python object that you can "call" to **perform an action** or compute and **return another object**. You call a function by placing parentheses to the right of the function name. Some functions allow you to pass **arguments** inside the parentheses (separating multiple arguments with a comma). Internal to the function, these arguments are treated like variables.

Python has several useful built-in functions to help you work with different objects and/or your environment. Here is a small sample of them:

- **`type(obj)`** to determine the type of an object
- **`len(container)`** to determine how many items are in a container
- **`callable(obj)`** to determine if an object is callable
- **`sorted(container)`** to return a new list from a container, with the items sorted
- **`sum(container)`** to compute the sum of a container of numbers
- **`min(container)`** to determine the smallest item in a container
- **`max(container)`** to determine the largest item in a container
- **`abs(number)`** to determine the absolute value of a number
- **`repr(obj)`** to return a string representation of an object

> Complete list of built-in functions: https://docs.python.org/3/library/functions.html


There are also different ways of defining your own functions and callable objects that we will explore later.

#### Examples
Run the following cells to see how Python built-in functions and callables work

In [None]:
# Use the type() function to determine the type of an object
type(simple_string1)


In [None]:
# Use the len() function to determine how many items are in a container
len(dict1)

### 1.7 Python object attributes (methods and properties) <a name="attributes"></a>

Different types of objects in Python have different **attributes** that can be referred to by name (similar to a variable). To access an attribute of an object, use a dot (`.`) after the object, then specify the attribute (i.e. `obj.attribute`)

When an attribute of an object is a callable, that attribute is called a **method**. It is the same as a function, only this function is bound to a particular object.

When an attribute of an object is not a callable, that attribute is called a **property**. It is just a piece of data about the object, that is itself another object.

The built-in `dir()` function can be used to return a list of an object's attributes.

<hr>

#### Some methods on list objects <a name="lists"></a>

- **`.append(item)`** to add a single item to the list
- **`.extend([item1, item2, ...])`** to add multiple items to the list
- **`.remove(item)`** to remove a single item from the list
- **`.pop()`** to remove and return the item at the end of the list
- **`.pop(index)`** to remove and return an item at an index

#### Remember our list:
list1 = [3, 5, 6, 3, 'dog', 'cat', False]

#### Exercises
Attempt to solve the problems in the comments below. You can load the [Answers](#answers-7.2) when you are ready to check your work

In [None]:
# 7.2.1 Add a 'cow' to the list


In [None]:
# 7.2.2 Add 'chicken' and 'pig'


In [None]:
# 7.2.3 Remove the 'dog'


In [None]:
# 7.2.4 Remove and return the last item in the list


#### Answers
Uncomment and run the cell below to get the answers to the exercises in section 7.2

In [None]:
# %load https://raw.githubusercontent.com/IBM/python-and-analytics/master/data/answers/python3-answers-7.2.py


## Pandas

In this section we will take a quick look at common operations you can do in Pandas. Note that Pandas is an extremely capable library and you can perform almost anything you can think of to your data by combining it with python. However, in this section we are merely refreshing your memory about some of the common tasks which we will use in the rest of the course.

Keep in mind that there are more than 200 attributes and methods for the DataFrame that pandas uses and there are often many ways of achieveing the same result in Pandas. If you want to learn more a quick look at the documentation should get your started.

### 2.1 What is Pandas

[pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. The name comes from combining the words "Panel" and "Data". This library natively supports csv datasets as well as sql databases and allows you to explore and manipulate data.

### 2.2 Loading the Data



In [None]:
import pandas as pd
import numpy as np

There are several ways to load data in pandas. The simplest way is by using the `read_csv()` method. . Let's load our data and show the first few rows using the `.head()` method.

In [None]:
df_pd =  pd.read_csv("https://raw.githubusercontent.com/IBM/ml-learning-path-assets/master/data/predict_home_value.csv")
df_pd.head()

In [None]:
df_pd.columns

Let's select a few columns so we refresh our memory about the syntax. And also to simplify the next sections!

In [None]:
df_subset = df_pd[['LOTAREA', 'SALEPRICE', 'FIREPLACEQU','YEARBUILT', 'FULLBATH', 'FOUNDATION']]
df_subset

Next, we'll look at how we can view the types of data in each column. This is specially important since the operations that we would perform on numerical columns is different that the ones done to the categorical columns.

In [None]:
df_subset.dtypes

### 2.3 Cleaning the Data

Next, let's take a look at some of the common operations to inspect and clean the data.

You can use `.info()` to get some high level information about the columns. You can also use basic statistical methods such as `.mean()`, `.corr()`, `.max()`, `.min()` as well.

In [None]:
df_subset.info()

In [None]:
df_subset.corr()

Another important item to look at when cleaning the data, is missing values. Most ML algorithms don't handle missing values well so we should either remove or fill them in beforehand.

Let's first find our missing values and then fill them in.

In [None]:
# is the cell value NA?
df_subset.isna()

Now that we know if a cell is NA or not, we can count them to find out how much of each column is missing. Since `True` has a value of 1 and `False` has a value of 0. We can simply sum each column to find out how many missing values we have.

In [None]:
df_subset.isna().sum()

Next, let's fill in the blanks with the mode of the column and verify that we have no missing values left.

In [None]:
# Get Mode of each column and then take the first row with iloc
# If you're curious what .mode() returns, you can add a cell and 
# run df_subset.mode() 

column_mode = df_subset.mode().iloc[0]

df_subset = df_subset.fillna(column_mode)

In [None]:
df_subset.isna().sum()

Great. No more missing data.

There are many more operations that you can perform on the columns and rows of the dataframes. Take a look at the optional notebook that's in this workshop to learn more. 

## 3. Data Visualization

In order to better understand the data, we can use visualizations such as charts, plots, and graphs. We'll use some common tools such as [`matplotlib`](https://matplotlib.org/users/index.html) and [`seaborn`](https://seaborn.pydata.org/index.html) and gather some statistical insights into our data.

We'll continue to use the data that we loaded above which looks into the housing prices.

### Seaborn

Seaborn is a Python data visualization library based on matplotlib. It is an easy to use visualisation package that works well with Pandas DataFrames. 

Below are a few examples using Seaborn. 

Refer to this [documentation](https://seaborn.pydata.org/index.html) for information on lots of plots you can create.

In [None]:
import seaborn as sns

### 3.1 Histograms
Let's start by taking a look at the histogram of the of the sales price.

In [None]:
sns.histplot(df_pd['SALEPRICE'])

After looking at the documentation for hist plot, let's also add a kde to smooth out the distribution.

In [None]:
sns.histplot(df_pd['SALEPRICE'], kde=True)

#### Exercise 

Try plotting the histogram for the `LOTAREA`.

In [None]:
# YOUR ANSWER

### 3.2 countplots

Next, let's visualize how many times a value repeats in a column. For this, we will look at the `FULLBATH` column.

In [None]:
sns.countplot(x = df_pd['FULLBATH'])

#### Exercise

Do the same for the `HALFBATH` column

In [None]:
# YOUR ANSWER

### 3.3 Joint Plot

Next, let's take a look at a joint plot. This draws a plot of two variables with bivariate and univariate graphs.

In [None]:
sns.jointplot(x = 'SALEPRICE', y = 'LOTAREA', data=df_pd, kind='reg')

#### Exercise

Try the same plot above with the following different `kind`s:
 - `scatter`
 - `kde`
 - `hist`
 - `hex`
 - `reg`
 - `resid`

In [None]:
# YOUR ANSWER

### 3.4 Pairplot

Next we will look at PairPlots. If we were to create a pair plot for the full dataset, with all 32 columns, the image would be very hard to read. Therefore, we will pick 6 columns first and then create a pair plot.

Pair plots, show pairwise relationships in a dataset.

In [None]:
df_subset = df_pd[['LOTAREA', 'SALEPRICE', 'FIREPLACEQU','YEARBUILT', 'FULLBATH', 'FOUNDATION']]

sns.pairplot(df_subset)

#### Exercise

Select a different subset of the columns that you find interesting and then plot then explore their relationship using a pairplot.

In [None]:
# YOUR ANSWER

<hr>

## Conclusion

At this point we have completed our quick review of the basics of Python, Pandas, and Seaborn for Machine Learning and you are ready to move on the next sections. Remember that if you are looking for more practice regarding what we covered here, the optional notebooks included in your projects are the best place to start.
