# Getting started with Python, NumPy, and Pandas

Before you dive into data science projects and different machine learning algorithms, it is important to get accustomed to the tools that we will be using: the programming language Python, and two of its most popular packages in data science, NumPy and Pandas.

This guide will guide you through simple Python operators and syntax, and common function calls from NumPy and Pandas. For more extensive guides and tutorials, I recommend checking out the links below:
- [NumPy Beginner's Guide ebook](https://github.com/shihyu/python_ebook/tree/master/NumPy)
- [NumPy Quickstart Tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
- [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)
- [Pandas DataFrame tutorial on DataCamp](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python#gs.DYueGfM)

## Python

### Native datatypes
- **Numbers**: Python supports integers, floats, and complex numbers, even though we will only be using the first two types most of the time.

In [1]:
a = 3
print('a:', a)

b = 2
print('b:', b)

c = a + b
print('c:', c)

print('Multiplication for a and b:', a * b)

print('Normal (float) division for a and b:', a / b)

print('Floor (integer) division for a and b:', a // b)

print('a to the power of b:', a ** b)

a: 3
b: 2
c: 5
Multiplication for a and b: 6
Normal (float) division for a and b: 1.5
Floor (integer) division for a and b: 1
a to the power of b: 9


- **String**: a string in Python is a sequence of characters, which could be one, more than one, or no characters at all.

In [2]:
a = 'I am a string'
print('a:', a)

b = 'I am a different string'
print('b:', b)

c = a + b
print('String concatenation:', c)

print('Looping through characters in a string:')
for i in a:
    print(i)

d = c[1:4]
print('Making sub-strings by slicing a string from index 1 to index 4:', d)

a: I am a string
b: I am a different string
String concatenation: I am a stringI am a different string
Looping through characters in a string:
I
 
a
m
 
a
 
s
t
r
i
n
g
Making sub-strings by slicing a string from index 1 to index 4:  am


- **List**: a sequence, or array, of multiple variables.

In [3]:
a = ['a', 'b', 'c']
print('A whole list could be printed out all at once:', a)

b = [1, 2, 3]
print('Looping through elements in a list:')
for i in b:
    print(i)

print('Accessing an element at a specific index:', a[0], b[1])

c = a + b
print('List concatenation', c)

d = c[1:3]
print('Slicing a list from index 1 to index 3:', d)

A whole list could be printed out all at once: ['a', 'b', 'c']
Looping through elements in a list:
1
2
3
Accessing an element at a specific index: a 2
List concatenation ['a', 'b', 'c', 1, 2, 3]
Slicing a list from index 1 to index 3: ['b', 'c']


- **Dictionary**: an ordered collection of value-pair items - similar to Maps or HashMaps in Java. Each item in a dictionary has a key and a value.

In [4]:
a = {'a': 1, 'b': 2}
print('A whole dictionary could be printed out all at once:', a)

a['c'] = 3
print('Assign a value to a new key to add items to a dictionary:', a)

print('Looping through keys in a dictionary and getting the value at each key:')
for key in a:
    print('key:', key)
    print('value:', a[key])

A whole dictionary could be printed out all at once: {'a': 1, 'b': 2}
Assign a value to a new key to add items to a dictionary: {'a': 1, 'b': 2, 'c': 3}
Looping through keys in a dictionary and getting the value at each key:
key: a
value: 1
key: b
value: 2
key: c
value: 3


### Flow control in Python

The `if...elif...else` statement is used for decision making.

In [5]:
a = 1
b = 2
c = 4

if a == 1:
    print('a is equal to 1.')
else:
    print('a is not equal to 1.')

if a == 2:
    print('a is equal to 2.')
elif c == b * 2:
    print('c is twice b')
else:
    print('No condition is true.')

a is equal to 1.
c is twice b


The `for` loop is typically used, as shown in the section above, to loop through elements in a sequence (strings, lists, dictionaries, etc.)

In [6]:
a = [1, 'a', 'A String']

for item in a:
    print(item)

1
a
A String


### Functions in Python

In Python, function is a group of related statements that perform a specific task. Functions help break our program into smaller chunks, making it more organized and manageable as our program grows larger.

In [7]:
def sum_of(a, b):
    return a + b

x = 1
y = 2
z = sum_of(x, y)

print('The returned value could be directly printed out:', sum_of(x, y))
print('Or it could be stored in a varible:', z)

The returned value could be directly printed out: 3
Or it could be stored in a varible: 3


## NumPy

NumPy is one of the most popular Python packages, used for efficient computation of multidimensional arrays of the same type. To import the package in your program, run:

In [8]:
import numpy as np

The most common way to initialize a NumPy array is to convert a Python list:

In [9]:
my_list = [1, 2, 3]
print(my_list)

my_array = np.array(my_list)
print(my_array)

[1, 2, 3]
[1 2 3]


It is also possible to call different NumPy built-in functions to get specific types of array:

In [10]:
array1 = np.arange(6)
print('NumPy array from 0 up to 6:', array1)

array2 = np.zeros((3,))
print('NumPy array of zeros with the specified shape:', array2)

array3 = np.ones((5,))
print('NumPy array of ones with the specified shape:', array3)

NumPy array from 0 up to 6: [0 1 2 3 4 5]
NumPy array of zeros with the specified shape: [ 0.  0.  0.]
NumPy array of ones with the specified shape: [ 1.  1.  1.  1.  1.]


**Basic Operations:** Arithmetic operations on NumPy arrays are applied _element-wise_:

In [11]:
my_array = np.array([1, 3, 5, 2])

print('Original array:', my_array)
print('Elements in array multiplied by 3:', my_array * 3)
print('Elements in array squared:', my_array ** 2)

new_array = np.array([1, 2, 3, 4])
print('Another array:', new_array)
print('Adding two arrays:', my_array + new_array)
print('Dividing one array by another:', my_array / new_array)

Original array: [1 3 5 2]
Elements in array multiplied by 3: [ 3  9 15  6]
Elements in array squared: [ 1  9 25  4]
Another array: [1 2 3 4]
Adding two arrays: [2 5 8 6]
Dividing one array by another: [ 1.          1.5         1.66666667  0.5       ]


**Concept of Shape:** Sometimes data is stored in tables, and the best way to keep the dimentionality of the data is to store it in two-dimensional NumPy arrays:

In [12]:
my_array = np.array([[1, 3, 5], [2, 4, 6]])
print(my_array)
print('Number of rows and number of columns:', my_array.shape)

[[1 3 5]
 [2 4 6]]
Number of rows and number of columns: (2, 3)


NumPy also allows you to dynamically change the shape of its arrays. Do this by using `reshape()` function:

In [13]:
my_array = my_array.reshape((3,2))
print(my_array)
print('Number of rows and number of columns:', my_array.shape)

[[1 3]
 [5 2]
 [4 6]]
Number of rows and number of columns: (3, 2)


**Indexing and Slicing:**
- One-dimensional arrays can be indexed and sliced just like Python lists:

In [14]:
my_array = np.array([1, 2, 3, 4, 5])
print('Element at index 1:', my_array[1])
print('Element at index 3:', my_array[3])

sub_array = my_array[1:4]
print('Sub array from index 1 to index 4:', sub_array)

Element at index 1: 2
Element at index 3: 4
Sub array from index 1 to index 4: [2 3 4]


- Two-dimensional arrays need to be indexed and sliced with a pair of limiters, seperated by a comma:

In [15]:
my_array = np.array([[1, 2, 3, 4, 5], [2, 3, 6, 7, 3], [-1, 3, -4, 5, 3]])
print('Original array:')
print(my_array)

print('Row at index 1 (second row):', my_array[1,:])
print('Column at index 0 (first column):', my_array[:,0])

print('Elements between rows with index 0 and index 2, columns with index 1 and index 3:')
print(my_array[0:2, 1:3])

Original array:
[[ 1  2  3  4  5]
 [ 2  3  6  7  3]
 [-1  3 -4  5  3]]
Row at index 1 (second row): [2 3 6 7 3]
Column at index 0 (first column): [ 1  2 -1]
Elements between rows with index 0 and index 2, columns with index 1 and index 3:
[[2 3]
 [3 6]]


## Pandas

Pandas is a powerful and flexible open source data analysis and manipulation tool, providing easy and efficient computations on large amount of data.

The Pandas `DataFrame` data structure would probably be the tool we use the most from Pandas. It is usually used to store the whole dataset, and from that `DataFrame` we could perform different data analysis methods on it.

**Initializing a DataFrame:** To initialize a `DataFrame`, pass a Python dictionary mapping **strings to Python lists** to `pandas.DataFrame()` function:

In [16]:
import pandas as pd

my_df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 3, 4], 'z': [9, 8, 7]})
print('The whole dataset:')
print(my_df)

print('Accessing specific columns:')
print(my_df['x'])
print(my_df['z'])

The whole dataset:
   x  y  z
0  1  2  9
1  2  3  8
2  3  4  7
Accessing specific columns:
0    1
1    2
2    3
Name: x, dtype: int64
0    9
1    8
2    7
Name: z, dtype: int64


Most of the time, we will be reading from .csv files and pass the data to a `DataFrame`, to do that, we simply call:
```
my_df = pd.read_csv('[path to the .csv file]')
```

**Indexing and Slicing Rows:** Note that an extra column on the left has been added to our dataset, this is the _index_ column, which is used to differentiate different rows from each other. As shown above, the indices are kepts intact when we seperate individual columns from the original dataset.

The indices are also used to access individual rows in the dataset. Same idea with indexing and slicing, but this time we need to use `loc` function:

In [17]:
print('Getting row at index 1:')
print(my_df.loc[1])

print('Getting the rows from index 0 to index 1 (inclusively):')
print(my_df.loc[0:1])

Getting row at index 1:
x    2
y    3
z    8
Name: 1, dtype: int64
Getting the rows from index 0 to index 1 (inclusively):
   x  y  z
0  1  2  9
1  2  3  8


**Re-indexing a DataFrame:** It is also possible to assign a specific column in the dataset to be the index column:

In [18]:
my_df = my_df.set_index(['x'])
print(my_df)

   y  z
x      
1  2  9
2  3  8
3  4  7


**Iterating through a DataFrame:** To loop through the data in a `DataFrame`, we usually go through individual rows or individual columns. For example:

In [19]:
# looping through the columns
for column in my_df.columns:
    print(column)
    print(my_df[column])

y
x
1    2
2    3
3    4
Name: y, dtype: int64
z
x
1    9
2    8
3    7
Name: z, dtype: int64


In [20]:
# looping through the indices (i.e. through the rows)
for index in my_df.index:
    print(my_df.loc[index])

y    2
z    9
Name: 1, dtype: int64
y    3
z    8
Name: 2, dtype: int64
y    4
z    7
Name: 3, dtype: int64


**Adding columns to a DataFrame:** Similar to adding an item to a dictionary, we could do:

In [21]:
print('Original DataFrame:')
print(my_df)

# adding a column with the name 't'
my_df['t'] = [20, 3, 19]
print('DataFrame after adding t:')
print(my_df)

Original DataFrame:
   y  z
x      
1  2  9
2  3  8
3  4  7
DataFrame after adding t:
   y  z   t
x          
1  2  9  20
2  3  8   3
3  4  7  19


**Saving a DataFrame to file:** simple call `my_df.to_csv('[path to file]')` to save `my_df` to a .csv file on your local machine. Note that you don't need to create a .csv file with the specified name before calling `to_csv()`, if the file doesn't exist, Pandas will create a new file with the specified name for you. But if a file with the same name is already there, that file would be overwritten.

**Dealing with missing values:** it is quite common for datasets from real-life problems to contain missing values, which usually result from data being collected incorrectly, or a specific information being not applicable on some elements in the dataset. Missing values are denoted by Pandas as `NaN` (Not a number). For example:

In [59]:
my_df = pd.DataFrame({'x': [None, 1, 3, 3], 'y': [4, 2, 9, 8], 'z': [3, 4, None, 5]})
print(my_df)

     x  y    z
0  NaN  4  3.0
1  1.0  2  4.0
2  3.0  9  NaN
3  3.0  8  5.0


It is usually not desirable to leave `Nan` values as is in our dataset. Many strategies to process missing values are used, depending on the type of data that we have. To name a few:
- Dropping rows or columns with `NaN` values:

In [60]:
# dropping the rows with NaN values
filled_df = my_df.dropna(axis=0)
print(filled_df)

     x  y    z
1  1.0  2  4.0
3  3.0  8  5.0


In [61]:
# dropping the columns with NaN values
filled_df = my_df.dropna(axis=1)
print(filled_df)

   y
0  4
1  2
2  9
3  8


Although beneficial at times, as shown above, if our data contains a large number of missing values, dropping the rows or columns that contain `NaN` might not be such a good idea. Other methods attempt to find meaningful ways to fill in the missing data.
- Using the `fillna()` function:

In [62]:
# filling the missing values with the mean, mode, and median of corresponding columns
filled_df = my_df.fillna(my_df.mean())
print(filled_df)
filled_df = my_df.fillna(my_df.mode().loc[0])
print(filled_df)
filled_df = my_df.fillna(my_df.median())
print(filled_df)

          x  y    z
0  2.333333  4  3.0
1  1.000000  2  4.0
2  3.000000  9  4.0
3  3.000000  8  5.0
     x  y    z
0  3.0  4  3.0
1  1.0  2  4.0
2  3.0  9  3.0
3  3.0  8  5.0
     x  y    z
0  3.0  4  3.0
1  1.0  2  4.0
2  3.0  9  4.0
3  3.0  8  5.0


- Using the `interpolate()` function:

In [63]:
filled_df = my_df.interpolate()
print(filled_df)

     x  y    z
0  NaN  4  3.0
1  1.0  2  4.0
2  3.0  9  4.5
3  3.0  8  5.0


The `interpolate()` function fills in each missing value with the midpoint of the two values above and below it. In our example, the top left cell is left as is, as there is no value above it, but the second `NaN` cell is filled with `4.5`, which is the midpoint of the value above (`4.0`) and the value below (`5.0`).