### Essential Python Libraries
- **NumPy**, short for numerical Python. It provides the data structures and algorithms needd for most scientific applications involving numerical data in Python. 
    - NumPy contains: 
        - A fast and efficient multidimensional array object *ndarray*
        - Functions for performing element-wise computations with arrays or mathematical operations between arrays. 
        - Tools for reading and writing array-based datasets to disk
        - Linear algebra operations
    - Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary uses in data analysis is as a contained for data to be passed between algorithms and libraries. 
       - For numerical data, NumPy arrays are efficient for storing and manipulating data than the other built in Python data structures. Thus many numerical computing tools for Python either assume NumPy arrays as a primary data structure or they target seamless interoperability with NumPy. 
       
- **pandas** provides high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive. The primary objects in pandas are the `DataFrame`, a tabular column-oriented data structure with both row and column labels, and the `Series`, a one dimensional labeled array object. 
    - pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases.
        - It provides sophisticated insexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.
    - Note, as a result of having been built initially to solve finance and business analytics problems, pandas features especially deep time series functionality and tools well suited for working with time-indexed data generated by business processes. 
    
- **matplotlib** is the most popular Python library for producing plots and other two-dimensional data visualizations. 

- **SciPy** is a collection of packages addressing a number of different standard problem domains in scientific computing. 
    - Here are some of the packages included:
        - `scipy.integrate` included numerical integaration routines and differential equation solvers. 
        - `scipy.linalg` includes linear algebra routines and matrix decompositions extending beyond those provided in `numpy.linalg`.
        - `scipy.optimize` includes function optimizers (minimizeers) and root finding algorithms. 
        - `scipy.signal` includes signal processing tools.
        - `scipy.sparse` includes sparse matrices and sparse linear system solvers. 
        - `scipy.stats` includes standard continous and discrete probability distributions (density functions, samplers, continous distribution functions), various statistical tests, and more descriptive statistics. 
- **scikit-learn** is the premier general-purpose machine learning toolkit in Python. 
    - It includes submodules for models such as: 
        - Classification: SVM, nearest neighbors, random forest, logistic regression, etc.
        - Regression: Lasso, ridge regression, etc.
        - Clustering: k-means, spectral clustering, etc.
        - Dimensionality reduction: PCA, feature selection, matrix factorization, etc.
        - Model selection: Grid search, cross-validation, metrics
        - Preprocessing: Feature extraction, normalization
- **statsmodels** is a statistical analysis package that contains algorithms for classical statistics and econometrics. 
    - It includes submodules such as: 
        - Regression models: Linear regression, generalized linear models, robust linear models, linear mixed effects models, etc.
        - Analysis of variance (ANOVA)
        - Time series analysis: AR, ARMA, ARIMA, VAR, and other models
        - Nonparametric methods: Kernel density estimation, kernel regression
        - Visualization of statistical model results
    - statsmodels is more focused on statistical inference, providing uncertainty estimates and p-values for parameters. scikit-learn, by contrast is more prediction-focused. 
    
Data science tasks generally fall into a number of different broad groups:
- *Interacting with the outside world*
    - Reading and writing with a variety of file formats and data stores
- *Preparation*
    - Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis. 
- *Transformation*
    - Applying mathematical and statistical operations to groups of datasets to derive new datasets. 
- *Modeling and computation*
    - Connecting your data to statistical models, machine learning algorithms, or other computational tools. 
- *Presentation*
    - Creating interactive or static graphical visualizations or textual summaries. 
    
**Import Conventions**
```python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm
```
Note: it's considered bad practice in Python software development to import everything from a large package. When you can you should only import what you need. 

**Jargon**
- *Munge/munging/wrangling*
    - Describes the process of manipulating unstructured and/or messy data into a structured or clearn form. 
- *Pseudocode*
    - A description of an algorithm or process that takes a code-like form while n ot being actual valid code. 
- *Syntactic sugar*
    - Programming syntax that does not add new features, but makes something more convenient or easier to type.



**Opening text files**
To open a file for reading or writing, use the built-in `open` function with either a relative or absolute path. When you `open` to create file objects, it is important to explicitly clsoe the file when you are finished with it. Closing the file releases its resources back to the operating system.

One way to make it easuer to clean up open files is to use the `with` statement. This will automatically close the file `f` when exiting the `with` block.

Careful: if we had typed `f = open(path,'w')`, a *new file* would have been created, overwriting the current version of the file. 

Python file modes

|**Mode**| **Description**|
|--------|----------------|
|`r` |Read-only mode|
|`w` |Write-only mode; creates a new file (erasing the data for any file with the same name)|
|`x` |Write-only mode; creates a new file, but fails if the file path already exists|
|`a` |Append to existing file (create the file if it does not already exist)|
|`r+` |Read and write|
|`b` |Add to mode for binary files (i.e., `'rb'` or `'wb'`)|
|`t` |Text mode for files (automatically decoding bytes to Unicode). This is the default if not specified. Add t to other modes to use this (i.e., `'rt'` or `'xt'`)|

In [27]:
pwd

'/Users/sarahamiraslani/Learning/Python/Data Science'

In [39]:
path = '/Users/sarahamiraslani/Learning/data/segismundo.txt'
with open(path) as f:
    lines = [x.rstrip() for x in f]

In [38]:
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.',
 '']

In [43]:
with open(path) as f:
    lines = f.readlines()

In [46]:
f.closed

True

In [41]:
lines

['Sueña el rico en su riqueza,\n',
 'que más cuidados le ofrece;\n',
 '\n',
 'sueña el pobre que padece\n',
 'su miseria y su pobreza;\n',
 '\n',
 'sueña el que a medrar empieza,\n',
 'sueña el que afana y pretende,\n',
 'sueña el que agravia y ofende,\n',
 '\n',
 'y en el mundo, en conclusión,\n',
 'todos sueñan lo que son,\n',
 'aunque ninguno lo entiende.\n',
 '\n']

For readable files, some of the most commonly used methods are `read`, `seek`, and `tell`. 
- `read` returns a certain number of characters from the file. 
- `tell` gives you the current position.
- `seek` changes the file position. 

Important Python file methods or attributes

|**Method**|**Description**|
|----------|---------------|
|`read()` |Return data from file as a string, with optional size argument indicating the number of bytes to read|
|`readlines()`| Return list of lines in the file, with optional size argument|
|`write()` | Write passed string to file|
|`writelines()`| Write passed sequence of strings to the file|
|`close()` | Close the handle |
|`flush()` | Flush the internal I/O buffer to disk|
|`seek()` | Move to indicated file position (integer)|
|`tell()` | Return current file position as integer|
|`closed` |True if the file is closed|

## NumPy Basics

For data analysis applications, the main areas of functionality include:
- Fast vectorized array operations for data cleaning, subsetting, filtering, transforming, and any other kinds of computations. 
- Common array algorithms like sorting, unique, and set operations. 
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining together heterogeneous datasets.
- Expressing conditional logic as array expressions instead of loops with `if-elif-else` branches. 
- Group-wise data manipulations (aggregation, transformation, function application)

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this. 
- NumPy interally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy arrays also use much less memory than built-in Python sequences. 
- NumPy operations perform complex computations on entire arrays without the need for Python `for` loops. 

**The NumPy ndarray: A Multidimensional Array Object**
Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. 

In [51]:
import numpy as np

data = np.random.randn(2,3)
print(data)

print('\n')
print(data * 10)

[[-0.66574468 -0.9375177   0.87080909]
 [-0.42403869  0.25635012 -0.67678059]]


[[-6.6574468  -9.37517698  8.70809091]
 [-4.24038693  2.56350125 -6.76780587]]


An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. 

**Creating ndarrays**
The easiest way to create an array is to use the `array` function. This accepts any sequence-like object and produce a new NumPy array containing the passed data. Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array. 

In [53]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

In [55]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Unless explicityly specified, `np.array` tries to infer a good data type for the array it creates. The data type is stored in a special `dtype` metadata object.

Array creation functions

|**Function** | **Description**|
|-------------|----------------|
|`array` |Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default|
|`asarray` |Convert input to ndarray, but do not copy if the input is already an ndarray arange Like the built-in range but returns an ndarray instead of a list| 
|`ones`,`ones_like` | Produce an array of all 1s with the given shape and dtype; ones_like takes another array and produces a ones array of the same shape and dtype|
|`zeros`, `zeros_like` | Like ones and ones_like but producing arrays of 0s instead|
|`empty`,`empty_like` | Create new arrays by allocating new memory, but do not populate with any values like ones and zeros|
|`full`, `full_like` |Produce an array of the given shape and dtype with all values set to the indicated “fill value” `full_like` takes another array and produces a filled array of the same shape and dtype|
|`eye`, `identity`| Create a square N × N identity matrix (1s on the diagonal and 0s elsewhere)|

**Data Types for ndarrays**
dtypes are a source of NumPy's flexibility for interacting with data coming from other systems. 

It's often only necessary to care about the general *kind* of data you're dealing with, whether floating point, complex, integer, boolean, string, or general Python object. However, when you need more control over how data are stored in memory and on disk, especially with large datsets, it is good to know that you can have control over the storage type. 

NumPy data types

|**Types** | **Type code** | **Description**|
|----------|---------------|----------------|
|`int8`, `uint8`| `i1`, `u1`| Signed and unsigned 8-bit (1 byte) integer types|
|`int16`, `uint16`| `i2`, `u2`| Signed and unsigned 16-bit integer types|
|`int32`, `uint32`| `i4`, `u4`| Signed and unsigned 32-bit integer types|
|`int64`, `uint64`| `i8`, `u8`| Signed and unsigned 64-bit integer types|
|`float16`| `f2`| Half-precision floating point|
|`float32`| `f4` or `f`| Standard single-precision floating point; compatible with C float|
|`float64`| `f8` or `d`|Standard double-precision floating point; compatible with C double and Python float object|
|`loat128`| `f16` or `g`| Extended-precision floating point|
|`complex64`,`complex128`, `complex256`|`c8`, `c16`, `c32`|Complex numbers represented by two 32, 64, or 128 floats, respectively|
|`bool`| `?` |Boolean type storing True and False values
|`object`| `O` |Python object type; a value can be any Python object
|`string_`| `S` |Fixed-length ASCII string type (1 byte per character); for example, to create a string dtype with length 10, use 'S10'|
|`unicode_` |`U` |Fixed-length Unicode type (number of bytes platform specific); same specification semantics as string_ (e.g., 'U10')|

You can explicitly convert or *cast* an array from one dtype to another usin ndarray's `astype` method. 

In [61]:
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(arr.dtype)

print('\n')

float_arr = arr.astype(np.float64)
print(float_arr)
print(float_arr.dtype)

[1 2 3 4 5]
int64


[1. 2. 3. 4. 5.]
float64


Note that if you case some floating-point numbers to be of integer dtype, the decimal part will be truncated. 

In [66]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
print(arr)
print(arr.dtype)

print('\n')

int_arr = arr.astype(np.int64)
print(int_arr)
print(int_arr.dtype)

[ 3.7 -1.2 -2.6  0.5 12.9 10.1]
float64


[ 3 -1 -2  0 12 10]
int64


**Basic Indexing and Slicing**
An important distinction from Python's built-in lists is that array slices are *views* of the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array. 

In [69]:
arr = np.arange(10)
print(arr)

arr[5:8] = 12
print(arr)

[0 1 2 3 4 5 6 7 8 9]
[ 0  1  2  3  4 12 12 12  8  9]


In [73]:
# create a slice of arr
arr_slice = arr[5:8]

# notice: when we change values in arr_slice, the mutations are reflected in the original
# array arr
arr_slice[1] = 12345
print(arr)

[    0     1     2     3     4    12 12345    12     8     9]


If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array -- for example, `arr[5:8].copy()`. 

In [77]:
arr3d = np.array(([1,2,3,4],[4,5,6,7]))

In [79]:
arr3d[0]

array([1, 2, 3, 4])