###**In the curent notebook we are going to explain:**

*   What is Pandas and why do we want to start using it in data analytics?
*   Data structures in Python and in Pandas
    *   Performance in Pandas (sort, max, mean)
    *   When to use Pandas
    *   Data Types

In [0]:
import timeit
import pandas as pd
import numpy as np

## 1. Why Pandas?

**I have a copy shop that is not doing well. I need to optimize the resources to the days where there are more customers and get more insights about the type of customers that spend more money on specific days.**

The questions would be:
*   Which days do I have more visits?
*   Which days do I have more income?
*   Is the number of visits always correlated to the revenue?

**Possible use case: "I find out that on Wednesdays the marketing department from X company have team lunch and always love to come around to buy fancy office material. This seems to be the main source of income. May I specialize on offering these kind of products?"**

In [0]:
# Lists
index_events = [0,1,2,3,4,5,6,7,8]
days_of_week = ['monday','thursday','saturday','sunday','monday','monday','wednesday','thursday','monday']
visits = [6,0,3,1,9,3,2,1,None]
income_euro = [43,7,23,11,0,34,324,55,45]

In [0]:
# Exercise 1. Anwser the questions with python Numpy or other Python packages.

# Which days do I have more visits?
# Which days do I have more income?
# Is the number of visits always correlated to the revenue?


**Pandas is well suited for many different kinds of data:**

*   Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
*   Ordered and unordered (not necessarily fixed-frequency) time series data.
*   Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
*   Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

**Here are just a few of the things that pandas does well:**
*   Easy handling of missing data (represented as NaN)
*   Size mutability: columns can be inserted and deleted from DataFrame
*   Automatic and explicit data alignment
*   Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
*   Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
*   Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
*   Intuitive merging and joining data sets
*   Flexible reshaping and pivoting of data sets
*   Hierarchical labeling of axes (possible to have multiple labels per tick)
*   Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
*   Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
*   Fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code
*   Dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
*   Has been used extensively in production in financial applications.



## 2. Pandas Data Structures: Series & Data Frames

### Pandas: From list to series - Functions
[Pandas Series Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)

In [18]:
# Count unique values
# List
import collections
counter = collections.Counter(days_of_week)
print("List solution: {}".format(counter))

# Series
print("\nSeries solution:\n")
pd.Series(days_of_week).value_counts()

List solution: Counter({'monday': 4, 'thursday': 2, 'sunday': 1, 'wednesday': 1, 'saturday': 1})

Series solution:



monday       4
thursday     2
wednesday    1
saturday     1
sunday       1
dtype: int64

In [19]:
# Find gaps in the data
# List
print("List solution: {}".format([True if x is None else False for x in visits]))

# Series
print("\nSeries solution:\n")
pd.Series(visits).isnull()

List solution: [False, False, False, False, False, False, False, False, True]

Series solution:



0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
dtype: bool

In [0]:
# Exercise 2. Calculate the maximum number of visits per day

In [0]:
# Exercise 3. Calculate the total number of visits registered

### Pandas: From list to series - Performance

In [0]:
# Build example lists with Numpy
short_list = np.random.randint(100, size=100)
large_list = np.random.randint(100, size=1000000)
short_series = pd.Series(short_list)
large_series = pd.Series(large_list)

In [24]:
# List Sorting
start_short = timeit.default_timer()
sorted(short_list)
stop_short = timeit.default_timer()

start_large = timeit.default_timer()
sorted(large_list)
stop_large = timeit.default_timer()

print('Running time for 100 elements (sec): {0:.7f}'.format(stop_short - start_short))
print('Running time for 10^6 elements (sec): {0:.7f}'.format(stop_large - start_large))

Running time for 100 elements (sec): 0.0001359
Running time for 10^6 elements (sec): 0.4794040


In [25]:
# Series Sorting
start_short = timeit.default_timer()
short_series.sort_values()
stop_short = timeit.default_timer()

start_large = timeit.default_timer()
large_series.sort_values()
stop_large = timeit.default_timer()

print('Running time for 100 elements (sec): {0:.7f}'.format(stop_short - start_short))
print('Running time for 10^6 elements (sec): {0:.7f}'.format(stop_large - start_large))

Running time for 100 elements (sec): 0.0016501
Running time for 10^6 elements (sec): 0.1261170


### Pandas: From lists to series - Statistics

**Calculate mean: Series vs List**

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/4e3313161244f8ab61d897fb6e5fbf6647e1d5f5)

In [39]:
# List
start_large = timeit.default_timer()

print(float(sum(large_list)) / len(large_list))

stop_large = timeit.default_timer()
print('Running time for 10^6 elements (sec): {0:.7f}'.format(stop_large - start_large))

49.520771
Running time for 10^6 elements (sec): 0.0872190


In [38]:
# Series
start_large = timeit.default_timer()

print(large_series.mean())

stop_large = timeit.default_timer()
print('Running time for 10^6 elements (sec): {0:.7f}'.format(stop_large - start_large))

49.520771
Running time for 10^6 elements (sec): 0.0140061


### Pandas: From dictionaries to Data Frames

![Data Structures](http://pbpython.com/images/pandas-dataframe-shadow.png)

#### *When to use what?*
*   "Small" Data: < 0.5 TB
      *    Iterative processes: All data structures are fine, normally use of lists and dictionaries
      *    Analysis of big chunks of data: use Pandas data frames
      
*   Big Data: > 0.5 TB
      *    Stream data real time processing: use generators (list of dictionaries)
      *    No real time stream data: processing will require RDDs or Spark data frames  
   

In [0]:
# Exercise 4. Build a dictionary with the lists from the copy shop data

In [0]:
# Exercise 5. Build a list of dictionaries with the lists from the copy shop data

In [0]:
# Exercise 6. Build a data frame with the lists from the copy shop data

### Pandas: Data Types
[Using pandas with large data](https://www.dataquest.io/blog/pandas-big-data/)

**Data Types:**
*   Average memory usage for float columns: 1.29 MB
*   Average memory usage for int columns: 1.12 MB
*   Average memory usage for object columns: 9.53 MB

In [0]:
# Exercise 7. See data types in the columns of the data frame with the copy shop data set