### Numpy

    1.Stands for ‘Numerical Python’ or ‘Numeric Python’. 
    2.It Provides fast mathematical computation on arrays and matrices.
    3.NumPy array can also be used as an efficient multi-dimensional container for generic data.
    
#### Array
    1.NumPy’s main object is the homogeneous multidimensional array 
    2.In NumPy, dimensions are called axes. 
    3.The number of axes is called the rank.
    4.You can create arrays in multiple ways
    5.Some of the important attributes of a NumPy object are:
### Important attributes of a NumPy object are:
    1.Ndim: displays the dimension of the array
    2.Shape: returns a tuple of integers indicating the size of the array
    3.Size: returns the total number of elements in the NumPy array
    4.Dtype: returns the type of elements in the array, i.e., int64, character
    5.Itemsize: returns the size in bytes of each item
    6.Reshape: Reshapes the NumPy array

**Features of Arrays**
* If you assign a single value to a ndarray slice, it is copied across the whole slice
* ndarray slices are actually views on the same data buffer. If you modify it, it is going to modify the original ndarray as well.
* The way multidimensional arrays are accessed using NumPy is different from how they are accessed in normal python arrays.
* An important feature with NumPy arrays is broadcasting.

Numpy also provides many functions to create arrays:

In [1]:
import numpy as np

a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2))  # Create an array filled with random values
print(e)                     # Might print "[[ 0.91940167  0.08143941]
                             #               [ 0.68744134  0.87236687]]"

[[0. 0.]
 [0. 0.]]
[[1. 1.]]
[[7 7]
 [7 7]]
[[1. 0.]
 [0. 1.]]
[[0.47779658 0.36837245]
 [0.0910966  0.07046796]]


#### Array indexing

#### 1.Slicing:

In [None]:
import numpy as np

# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]

# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

#### 2.Integer array indexing

Integer array indexing allows you to construct arbitrary arrays using the data from another array. Here is an example:

In [3]:
import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.
# The returned array will have shape (3,) and
print(a[[0, 1, 2], [0, 1, 0]])  # Prints "[1 4 5]"

# The above example of integer array indexing is equivalent to this:
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))  # Prints "[1 4 5]"

# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]])  # Prints "[2 2]"

# Equivalent to the previous integer array indexing example
print(np.array([a[0, 1], a[0, 1]]))  # Prints "[2 2]"


[1 4 5]
[1 4 5]
[2 2]
[2 2]


#### Fancy Indexing

In [None]:
import numpy as np

# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

print(a)  # prints "array([[ 1,  2,  3],
          #                [ 4,  5,  6],
          #                [ 7,  8,  9],
          #                [10, 11, 12]])"

# Create an array of indices
b = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
print(a[np.arange(4), b])  # Prints "[ 1  6  7 11]"

# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 10

print(a)  # prints "array([[11,  2,  3],
          #                [ 4,  5, 16],
          #                [17,  8,  9],
          #                [10, 21, 12]])

#### Boolean array indexing:

In [None]:
import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)   # Find the elements of a that are bigger than 2;
                     # this returns a numpy array of Booleans of the same
                     # shape as a, where each slot of bool_idx tells
                     # whether that element of a is > 2.

print(bool_idx)      # Prints "[[False False]
                     #          [ True  True]
                     #          [ True  True]]"

# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print(a[bool_idx])  # Prints "[3 4 5 6]"

# We can do all of the above in a single concise statement:
print(a[a > 2])     # Prints "[3 4 5 6]"


### Array Operations

In [None]:
import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))


#### Vertical & Horizontal Stacking
 Next, if you want to concatenate two arrays and not just add them, you can perform it using two ways – vertical stacking and horizontal stacking. Let me show it one by one in this python numpy tutorial.

In [38]:
import numpy as np
x= np.array([(1,2,3),(3,4,5)])
y= np.array([(1,2,3),(3,4,5)])
print(np.vstack((x,y)))
print(np.hstack((x,y)))

[[1 2 3]
 [3 4 5]
 [1 2 3]
 [3 4 5]]
[[1 2 3 1 2 3]
 [3 4 5 3 4 5]]


#### Sum

In [4]:
a= np.array([(1,2,3),(3,4,5)])
print(a.sum(axis=0))
# Output - [4 6 8]
# Therefore, the sum of all the columns are added where 1+3=4, 2+4=6 and 3+5=8. Similarly, if you replace the axis by 1, then it will print [6 12] where all the rows get added.

[4 6 8]


In [5]:
a

array([[1, 2, 3],
       [3, 4, 5]])

#### ravel

It convert one numpy array into a single column i.e ravel

In [6]:
import numpy as np
x= np.array([(1,2,3),(3,4,5)])
print(x.ravel())

[1 2 3 3 4 5]


#### numpy.argmax() and numpy.argmin()

In [46]:
import numpy as np 
a = np.array([[30,40,70],[80,20,10],[50,90,60]]) 

In [47]:
print('Applying argmax() function:' )
print(np.argmax(a))

Applying argmax() function:
7


In [53]:
print('Applying argmin() function:' )
print(np.argmin(a))

Applying argmin() function:
5


In [54]:
print('Array containing indices of maximum along axis 0:' )
np.argmax(a, axis = 0) 

In [55]:
print ('Array containing indices of maximum along axis 1:') 
maxindex = np.argmax(a, axis = 1) 

array([[30, 40, 70],
       [80, 20, 10],
       [50, 90, 60]])

In [56]:
maxindex

array([2, 0, 1], dtype=int64)

#### References

* https://www.edureka.co/blog/python-numpy-tutorial/

### Pandas

In [None]:
Theory
Data Frame Creations
Data Filtering
Slicing
DateTime handling
str relayed operations
Joins/Concatenations

#### Introduction
    
    1.Pandas is used for data manipulation, analysis and cleaning.
    1.Helps you to manage two-dimensional data tables in Python.
    2.Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
    3.Pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
    4. Python pandas is well suited for different kinds of data
    5. Pandas made Sting manupulation easier
    
#### What can be achieved using Python?
* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
* Flexible reshaping and pivoting of data sets
* High performance merging and joining of data sets;
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data
* Highly optimized for performance, with critical code paths written in Cython or C.

#### Data Stucrutes in Pandas
    1.Data Frames
    2.Series

#### 1.Series

    1.A pandas Series is a one dimensional data structure (“a one dimensional ndarray”) that can store values — and for every value it holds a unique index, too.


#### 2. Data Frames
    1. A pandas DataFrame is a two (or more) dimensional data structure – basically a table with rows and columns. 
    2.The columns have names and the rows have indexes.


#### Data Frame Creation
1. Using Nested List

In [None]:
# Import pandas library 
import pandas as pd 

# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 

# print dataframe. 
df 


2. From Dictionaries

In [33]:
# Python code demonstrate creating 
# DataFrame from dict narray / lists 
# By default addresses. 

import pandas as pd 

# intialise data of lists. 
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]} 

# Create DataFrame 
df = pd.DataFrame(data) 

# Print the output. 
df 


Unnamed: 0,Name,Age
0,Tom,20
1,nick,21
2,krish,19
3,jack,18


3. Creates a indexes DataFrame using arrays

In [34]:
# Python code demonstrate creating 
# pandas DataFrame with indexed by 

# DataFrame using arrays. 
import pandas as pd 

# initialise data of lists. 
data = {'Name':['Tom', 'Jack', 'nick', 'juli'], 'marks':[99, 98, 95, 90]} 

# Creates pandas DataFrame. 
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4']) 

# print the data 
df 

Unnamed: 0,Name,marks
rank1,Tom,99
rank2,Jack,98
rank3,nick,95
rank4,juli,90


4. Creating Dataframe from list of dicts

In [35]:
# Python code demonstrate how to create 
# Pandas DataFrame by lists of dicts. 
import pandas as pd 

# Initialise data to lists. 
data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}] 

# Creates DataFrame. 
df = pd.DataFrame(data) 

# Print the data 
df 


Unnamed: 0,a,b,c
0,1,2,3
1,10,20,30


5. Creating DataFrame using zip() function.

In [36]:
# Python program to demonstrate creating 
# pandas Datadaframe from lists using zip. 
	
import pandas as pd 
	
# List1 
Name = ['tom', 'krish', 'nick', 'juli'] 
	
# List2 
Age = [25, 30, 26, 22] 
	
# get the list of tuples from two lists. 
# and merge them by using zip(). 
list_of_tuples = list(zip(Name, Age)) 
	
# Assign data to tuples. 
list_of_tuples 


# Converting lists of tuples into 
# pandas Dataframe. 
df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age']) 
	
# Print data. 
df 


Unnamed: 0,Name,Age
0,tom,25
1,krish,30
2,nick,26
3,juli,22


6. Creating DataFrame from Dicts of series

In [37]:
# Python code demonstrate creating 
# Pandas Dataframe from Dicts of series. 

import pandas as pd 

# Intialise data to Dicts of series. 
d = {'one' : pd.Series([10, 20, 30, 40]), 
	'two' : pd.Series([10, 20, 30, 40])} 

# creates Dataframe. 
df = pd.DataFrame(d) 

# print the data. 
df 


Unnamed: 0,one,two
0,10,10
1,20,20
2,30,30
3,40,40


#### Data Filtering

In [32]:
df[df.one>=30]

Unnamed: 0,one,two
2,30,30
3,40,40


#### Indexing

##### .loc
* .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
* inputs can be
        * A single label
        * A list or array of labels ['a', 'b', 'c']
        * A slice object with labels 'a':'f' 
        * A boolean array
        
        
##### .iloc
* Primarily integer position based
* Inputs can be
        * An integer
        * A list or array of labels ['a', 'b', 'c']
        * A slice object with labels 'a':'f' 
        * A boolean array

#### Date Functionalities

Date functinalities plays a major role in Time series analysis/financial data analysis.
While doing this we usually need to
* Convertion of string to DateTime column
* Generate sequence of dates
* Convert the date series into different Frequencies

#### Parsing Date Time Column

* While reading file itself you can use parameter parse_dates["Date Column"]
* https://www.datacamp.com/community/tutorials/converting-strings-datetime-objects
* pd.to_datetime functinality can be used

#### Date Generation

In [13]:
import pandas as pd

print(pd.date_range('1/1/2011', periods=5))

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05'],
              dtype='datetime64[ns]', freq='D')


bdate_range
bdate_range() stands for business date ranges. Unlike date_range(), it excludes Saturday and Sunday.

In [15]:
import pandas as pd

pd.bdate_range('1/1/2011', periods=5)

DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
               '2011-01-07'],
              dtype='datetime64[ns]', freq='B')


pd.date_range(start_date, end_date, freq='W')

#### Adding and subtracting time

* In Python, the timedelta object from the datetime module is used to represent differences in datetime objects

#### String Manupulation using pandas

* https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

#### Merging & Joining

* pd.concat() function: the most multi-purpose and can be used to combine multiple DataFrames along either axis.
* DataFrame.append() method: a quick way to add rows to your DataFrame, but not applicable for adding columns.
* pd.merge() function: great for joining two DataFrames together when we have one column (key) containing common values.
* DataFrame.join() method: a quicker way to join two DataFrames, but works only off index labels rather than columns.

In [2]:
import pandas as pd
 
df1= pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
 
df2=pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])
 
merged= pd.merge(df1,df2)
 
print(merged)

   HPI  Int_Rate  IND_GDP
0   80         2       50
1   90         1       45
2   70         2       45
3   60         3       67


#### Join

If on=None, the join key will be the row index

In [21]:
df1 = pd.DataFrame({"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
 
df2 = pd.DataFrame({"Low_Tier_HPI":[50,45,67,34],"Unemployment":[1,3,5,6]}, index=[2001, 2003,2004,2004])

print(df1)
print(df2)
joined= df1.join(df2)
print(joined)

      Int_Rate  IND_GDP
2001         2       50
2002         1       45
2003         2       45
2004         3       67
      Low_Tier_HPI  Unemployment
2001            50             1
2003            45             3
2004            67             5
2004            34             6
      Int_Rate  IND_GDP  Low_Tier_HPI  Unemployment
2001         2       50          50.0           1.0
2002         1       45           NaN           NaN
2003         2       45          45.0           3.0
2004         3       67          67.0           5.0
2004         3       67          34.0           6.0


**Observation:**<br>
    The reason that the dtype is changed to float64 is because missing values NaN cannot be represented using integer.

#### Concat

In [4]:
df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
 
df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])
 
concat= pd.concat([df1,df2])
 
print(concat)

      HPI  Int_Rate  IND_GDP
2001   80         2       50
2002   90         1       45
2003   70         2       45
2004   60         3       67
2005   80         2       50
2006   90         1       45
2007   70         2       45
2008   60         3       67


Note : specify axis=1 in order to join, merge or cancatenate along the columns.

In [None]:
df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])
 
df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])
 
concat= pd.concat([df1,df2],axis=1)
 
print(concat)

## Summary

### Importing Data
* pd.read_csv(filename) | From a CSV file
* pd.read_table(filename) | From a delimited text file (like TSV)
* pd.read_excel(filename) | From an Excel file
* pd.read_sql(query, connection_object) | Read from a SQL table/database
* pd.read_json(json_string) | Read from a JSON formatted string, URL or file.
* pd.read_html(url) | Parses an html URL, string or file and extracts tables to a list of dataframes
* pd.read_clipboard() | Takes the contents of your clipboard and passes it to read_table()
* pd.DataFrame(dict) | From a dict, keys for columns names, values for data as lists

### Exporting Data
* df.to_csv(filename) | Write to a CSV file
* df.to_excel(filename) | Write to an Excel file
* df.to_sql(table_name, connection_object) | Write to a SQL table
* df.to_json(filename) | Write to a file in JSON format

### Create Test Objects
### Useful for testing code segements

* pd.DataFrame(np.random.rand(20,5)) | 5 columns and 20 rows of random floats
* pd.Series(my_list) | Create a series from an iterable my_list
* df.index = pd.date_range('1900/1/30', periods=df.shape[0]) | Add a date index

### Viewing/Inspecting Data
* df.head(n) | First n rows of the DataFrame
* df.tail(n) | Last n rows of the DataFrame
* df.shape | Number of rows and columns
* df.info() | Index, Datatype and Memory information
* df.describe() | Summary statistics for numerical columns
* s.value_counts(dropna=False) | View unique values and counts
* df.apply(pd.Series.value_counts) | Unique values and counts for all columns

### Selection
* df[col] | Returns column with label col as Series
* df[[col1, col2]] | Returns columns as a new DataFrame
* s.iloc[0] | Selection by position
* s.loc['index_one'] | Selection by index
* df.iloc[0,:] | First row
* df.iloc[0,0] | First element of first column

### Data Cleaning
* df.columns = ['a','b','c'] | Rename columns
* pd.isnull() | Checks for null Values, Returns Boolean Arrray
* pd.notnull() | Opposite of pd.isnull()
* df.dropna() | Drop all rows that contain null values
* df.dropna() | Drop all rows that contain null values
* df.dropna(axis=1) | Drop all columns that contain null values
* df.dropna(axis=1,thresh=n) | Drop all rows have have less than n non null values
* df.fillna(x) | Replace all null values with x
* s.fillna(s.mean()) | Replace all null values with the mean (mean can be replaced with almost any function from the statistics module)
* s.astype(float) | Convert the datatype of the series to float
* s.replace(1,'one') | Replace all values equal to 1 with 'one'
* s.replace([1,3],['one','three']) | Replace all 1 with 'one' and 3 with 'three'
* df.rename(columns=lambda x: x + 1) | Mass renaming of columns
* df.rename(columns={'old_name': 'new_ name'}) | Selective renaming
* df.set_index('column_one') | Change the index
* df.rename(index=lambda x: x + 1) | Mass renaming of index

### Filter, Sort, and Groupby
* df[df[col] > 0.5] | Rows where the column col is greater than 0.5
* df[(df[col] > 0.5) & (df[col] < 0.7)] | Rows where 0.7 > col > 0.5
* df.sort_values(col1) | Sort values by col1 in ascending order
* df.sort_values(col2,ascending=False) | Sort values by col2 in descending order
* df.sort_values([col1,col2],ascending=[True,False]) | Sort values by col1 in ascending order then col2 in descending order
* df.groupby(col) | Returns a groupby object for values from one column
* df.groupby([col1,col2]) | Returns groupby object for values from multiple columns
* df.groupby(col1)[col2] | Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics module)
* df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) | Create a pivot table that groups by col1 and calculates the mean of col2 and col3
* df.groupby(col1).agg(np.mean) | Find the average across all columns for every unique col1 group
* df.apply(np.mean) | Apply the function np.mean() across each column
* nf.apply(np.max,axis=1) | Apply the function np.max() across each row


### Join/Combine
* df1.append(df2) | Add the rows in df1 to the end of df2 (columns should be identical)
* pd.concat([df1, df2],axis=1) | Add the columns in df1 to the end of df2 (rows should be identical)
* df1.join(df2,on=col1,how='inner') | SQL-style join the columns in df1 with the columns on df2 where the rows for
* col have identical values. 'how' can be one of 'left', 'right', 'outer', 'inner'

### Statistics
These can all be applied to a series as well.
* df.describe() | Summary statistics for numerical columns
* df.mean() | Returns the mean of all columns
* df.corr() | Returns the correlation between columns in a DataFrame
* df.count() | Returns the number of non-null values in each DataFrame column

### References
* https://pandas.pydata.org/
* https://martin-thoma.com/pandas-merge-join-concatenate/
* https://towardsdatascience.com/collecting-data-science-cheat-sheets-d2cdff092855
* https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/
* https://riptutorial.com/pandas/example/23978/what-is-the-difference-between-join-and-merge
* https://leportella.com/cheatlist/2017/11/22/pandas-cheat-list.html