# Introduction to Data Analysis with Python


<img src="https://www.python.org/static/img/python-logo.png" alt="yogen" style="width: 200px; float: right;"/>
<br>
<br>
<br>
<img src="../assets/yogen-logo.png" alt="yogen" style="width: 200px; float: right;"/>

# Objectives

* Handle tabular data with `pandas`

# The Python scientific stack: SciPy

Python Main Data Libraries

NumPy: Base N-dimensional array package

SciPy library: Fundamental library for scientific computing

Matplotlib: Comprehensive 2D Plotting

IPython: Enhanced Interactive Console

Sympy: Symbolic mathematics

pandas: Data structures & analysis

## `matplotlib`

## `pandas`

### Getting started with pandas

### `pandas` data structures

### Series

The base pandas abstraction. You can thing of it as the love child of a numpy array and a dictionary.

If we provide an index, pandas will use it. If not, it will automatically create one.

We can create Series from dictionaries:

And here is where the magic happens: numpy arrays only identify their contents by position. In contrast, pandas knows their "name" and will align them based on their indexes:

### DataFrame

This is the object you'll work most of the time with. It represents a table of _m_ observations x _n_ variables. Each variable, or column, is a Series.


```python
dfdata = {
    'province' : ['M', 'M', 'M', 'B', 'B'],
    'population': [1.5e6, 2e6, 3e6, 5e5, 1.5e6],
    'year' : [1900, 1950, 2000, 1900, 2000]   
}

df = pd.DataFrame(dfdata)
```

### Index objects

Indexes are immutable.

### Dropping entries from an axis

By default, `drop()` doesn't modify the original Series- it creates a copy. We can change that with the argument `inplace`.

### Indexing, selection, and filtering

The key here is that we can build boolean Series that we can use to index the original Series or DataFrame. Those booleans can be combined with bitwise boolean operators (&, |, ~) to get filters that are as complex as we need. 

### Function application and mapping

Function application and mapping allows us to modify the elements of a DataFrame (columns with apply or elements with applymap) without for loops. This way we are not constrained to the functions already implemented by pandas or numpy.

This is a typical use case for lambdas (anonymous functions)

### Sorting and ranking

rank() returns the positions of the elements of the Series in its sorted version. If there are ties, it will take averages.

#### Exercise

Write a function that takes a Series and returns the top 20% registers. In this case, earners. Test it with this Series:

```python
salaries = pd.Series([10000, 43000, 150000, 90000, 120000,30000,10000,5000,40000, 50000, 80000, 35000, 27000,14000, 28000, 22000, 25000])
```

## Summarizing and computing descriptive statistics

As with many methods, we can use them in the direction perpendicular to their default.

### Unique values, value counts, and membership

#### Exercise

Calculate the %GC of the following DNA sequence:

```python
dna = pd.Series(list('agtcgggaactttctctcgaggagacccaa'))
```

## Handling missing data

This is weird... but it has some really good reasons. You can find explanations [here](https://stackoverflow.com/questions/10034149/why-is-nan-not-equal-to-nan) and [here](https://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values)

### Filtering out missing data

any() and all() are functions of boolean Series. They reduce the Series to a single boolean value by applying repeatedly the operators "or" and "and", respectively.

The thresh argument specifies the minimum number of non-null values required to keep a column (or row, with axis=1)

### Filling in missing data

# Loading and saving data

## Loading CSV

#### Exercise 

Calculate the number of routes to each destination country in the data.

Show all countries with more than 1000 routes.

#### Exercise

Extract the top 10 routes by passenger number. 

I only want to see origin, destination, and number of passengers.

## Saving to Excel

## Saving to CSV

## To Sql Database

## To dictionary and to json

## Reading Excel

## Reading mysql database

# Additional References

[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)

[What is SciPy?](https://www.scipy.org/)

[How can SciPy be fast if it is written in an interpreted language like Python?](https://www.scipy.org/scipylib/faq.html#how-can-scipy-be-fast-if-it-is-written-in-an-interpreted-language-like-python)

[What is the difference between NumPy and SciPy?](https://www.scipy.org/scipylib/faq.html#what-is-the-difference-between-numpy-and-scipy)

[Linear Algebra for AI](https://github.com/fastai/fastai/blob/master/tutorials/linalg_pytorch.ipynb)