# Python Tutorial for Data Science

#### What is Data Science?
`Data science` comprises three distinct and overlapping areas: the skills of a `statistician` who knows how to model and summarize datasets (which are growing ever larger); the skills of `a computer scientist` who can design and use algorithms to efficiently store, process, and visualize this data; and `the domain expertise` what we might think of as “classical” training in a subject necessary both to formulate the right questions and to put their answers in context.
</br>
![image-2.png](attachment:image-2.png)

Tutorial Outlines
* Introduction to NumPy
* Data Manipulation with Pandas
* Visualization with Matplotlib
* Machine Learning

# 1. Introduction to NumPy


**NumPy** (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. </br>
NumPy is a library provides the ndarray object for efficient storage and manipulation of
dense data arrays in Python. NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much
more efficient storage and data operations as the arrays grow larger in size.

In [2]:
#import numpy library
import numpy as np
np.__version__

'1.20.1'

#### Creating Arrays from Python Lists
NumPy is constrained to arrays that all contain the same type.

In [10]:
# integer array
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

In [11]:
np.array([3.14, 4, 2, 3])

array([3.14, 4.  , 2.  , 3.  ])

If we want to explicitly set the data type of the resulting array, we can use the dtype
keyword:

In [12]:
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

NumPy arrays can explicitly be multidimensional

In [13]:
# nested lists result in multidimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

#### Creating Arrays from Scratch

In [14]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [3]:
# Create a 3x5 floating-point array filled with 1s
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [4]:
# Create a 3x5 array filled with 3.14
np.full((3,5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [5]:
# Create an array filled with a linear sequence starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [6]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

### The Basics of NumPy Arrays
Basic array manipulations here:-
* **Attributes of arrays:** Determining the size, shape, memory consumption, and data types of arrays
* **Indexing of arrays:** Getting and setting the value of individual array elements
* **Slicing of arrays:** Getting and setting smaller subarrays within a larger array
* **Reshaping of arrays:** Changing the shape of a given array
* **Joining and splitting of arrays:** Combining multiple arrays into one, and splitting one array into many

In [11]:
# Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the total size of the array)
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


In [9]:
#print the data type 
print("dtype:", x3.dtype)

dtype: int64


In [10]:
#Itemsize, which lists the size (in bytes) of each array element, and nbytes, which lists the total size (in bytes) of the array
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

itemsize: 8 bytes
nbytes: 480 bytes


#### Array Indexing: Accessing Single Elements
In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in square brackets

In [12]:
x1

array([5, 0, 3, 3, 7, 9])

In [13]:
x1[0]

5

In [14]:
x1[4]

7

In [15]:
#To index from the end of the array, you can use negative indices
x1[-1]

9

In [16]:
x1[-2]

7

#### Array Slicing: Accessing Subarrays
Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon (:) character. The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:<br>
`x[start:stop:step]`

In [3]:
x = np.arange(10)

In [4]:
# accessing the first five elements
x[:5] 

array([0, 1, 2, 3, 4])

In [5]:
# accessing the elements after index 5
x[5:] 

array([5, 6, 7, 8, 9])

In [6]:
# accessing the middle subarray
x[4:7] 

array([4, 5, 6])

In [7]:
# every other element
x[::2] 

array([0, 2, 4, 6, 8])

In [8]:
# every other element, starting at index 1
x[1::2] 

array([1, 3, 5, 7, 9])

In [9]:
# all elements, reversed
x[::-1] 

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [10]:
# reversed every other from index 5
x[5::-2] 

array([5, 3, 1])

#### Multidimensional subarrays
Multidimensional slices work in the same way, with multiple slices separated by commas.

In [12]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [13]:
# accessing two rows, three columns 
x2[:2, :3] 

array([[3, 5, 2],
       [7, 6, 8]])

In [14]:
# accessing all rows, every other column
x2[:3, ::2] 

array([[3, 2],
       [7, 8],
       [1, 7]])

In [15]:
# reversed
x2[::-1, ::-1]

array([[7, 7, 6, 1],
       [8, 8, 6, 7],
       [4, 2, 5, 3]])

In [16]:
#Accessing array rows and columns.
print(x2[:, 0]) # first column of x2

[3 7 1]


In [17]:
# first row of x2
print(x2[0, :]) 

[3 5 2 4]


In [18]:
# equivalent to x2[0, :]
print(x2[0]) 

[3 5 2 4]


### Splitting of arrays
Split an array into multiple sub-arrays of equal size. Which is implemented by the functions np.split, np.hsplit, and np.vsplit.

In [19]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

[1 2 3] [99 99] [3 2 1]


In [20]:
grid = np.arange(16).reshape((4, 4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

**vsplit** is equivalent to split with axis=0 (default), the array is always split along the first axis regardless of the array dimension.

In [21]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]


**The hsplit()** function is used to split an array into multiple sub-arrays horizontally (column-wise).
hsplit is equivalent to split with axis=1, the array is always split along the second axis regardless of the array dimension.

In [22]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]


### Data Manipulation with Pandas
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

In [6]:
#Import pandas
import pandas as pd
pandas.__version__

'1.2.4'

#### The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data.

In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64