## Numpy Tutorial

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. <br><br>
NumPy is open-source software and has many contributors. NumPy brings the computational power of languages like C and Fortran to Python, a language much easier to learn and use.
<br><br>
**Arrays** <br>
An array is a vector containing homogeneous elements i.e. belonging to the same data type. Elements are allocated with contiguous memory locations allowing easy modification, that is, addition, deletion, accessing of elements. In Python, we have to use the array module to declare arrays.


In [1]:
# Initializing numpy

import numpy as np

In [2]:
lst = [1,2,3,4,5,6]
arr = np.array(lst)

print( type(arr) )
print(arr)

<class 'numpy.ndarray'>
[1 2 3 4 5 6]


In [3]:
arr.shape

(6,)

In [4]:
arr1 = [1,2,3,4,5]
arr2 = [8,3,4,6,7]
arr3 = [6,7,8,2,3]

arr = np.array([arr1,arr2,arr3])

In [5]:
print(arr.shape)
print(type(arr) )

(3, 5)
<class 'numpy.ndarray'>


In [6]:
arr.reshape(5,3)

array([[1, 2, 3],
       [4, 5, 8],
       [3, 4, 6],
       [7, 6, 7],
       [8, 2, 3]])

In [7]:
# Slicing - [ rowStart:rowEnd, colStart:colEnd ]

arr[:,1:4]

array([[2, 3, 4],
       [3, 4, 6],
       [7, 8, 2]])

In [8]:
# arange (start, end, stepSize) - Same as range,

newArr = np.arange(10,20,2)
print(newArr)

[10 12 14 16 18]


In [9]:
# linspace - used for plotting and kind

np.linspace(1,10,50)

array([ 1.        ,  1.18367347,  1.36734694,  1.55102041,  1.73469388,
        1.91836735,  2.10204082,  2.28571429,  2.46938776,  2.65306122,
        2.83673469,  3.02040816,  3.20408163,  3.3877551 ,  3.57142857,
        3.75510204,  3.93877551,  4.12244898,  4.30612245,  4.48979592,
        4.67346939,  4.85714286,  5.04081633,  5.2244898 ,  5.40816327,
        5.59183673,  5.7755102 ,  5.95918367,  6.14285714,  6.32653061,
        6.51020408,  6.69387755,  6.87755102,  7.06122449,  7.24489796,
        7.42857143,  7.6122449 ,  7.79591837,  7.97959184,  8.16326531,
        8.34693878,  8.53061224,  8.71428571,  8.89795918,  9.08163265,
        9.26530612,  9.44897959,  9.63265306,  9.81632653, 10.        ])

In [10]:
# copy is done referentially, so beware
oldArr = np.arange(1,10,2)
newArr = oldArr

print(newArr)

[1 3 5 7 9]


In [11]:
newArr[2:]=10
print(newArr)
print(oldArr)

[ 1  3 10 10 10]
[ 1  3 10 10 10]


In [12]:
# Instead use copy function, which creates a full mutable copy
oldArr = np.arange(1,10,2)
newArr = oldArr.copy()

newArr[2:]=10
print(newArr)
print(oldArr)

[ 1  3 10 10 10]
[1 3 5 7 9]


In [13]:
np.ones( (3,6),dtype=int)

array([[1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1]])

In [14]:
np.random.rand(4,3)

array([[0.85629039, 0.96274459, 0.73397376],
       [0.88515599, 0.83505274, 0.62900968],
       [0.13913051, 0.43423738, 0.5293996 ],
       [0.61472507, 0.69107064, 0.65490074]])

In [15]:
np.random.randint(0,1000,15).reshape(5,3)

array([[471, 790, 117],
       [892, 895,  49],
       [826, 899, 942],
       [370, 924, 418],
       [767, 279, 297]])

## Pandas Tutorial

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.<br><br>
Pandas is well suited for many different kinds of data:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


In [16]:
# Importing pandas

import pandas as pd
import numpy as np

**DataFrames** <br>
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. <br><br>Like Series, DataFrame accepts many different kinds of input:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame


In [17]:
# Working with Dataframes

df = pd.DataFrame( np.arange(0,20).reshape(5,4), index=['Row1','Row2','Row3','Row4','Row5'], columns=['Column1','column2','Column3','Column4'] )

In [18]:
# To display the top rows
df.head()

Unnamed: 0,Column1,column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [19]:
df.to_csv('Data1.csv')

**To access the elements**
1. loc  
loc() is label-based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike iloc() . 
<br><br>
2. .iloc<br>
iloc() is a indexed-based selecting method which means that we have to pass integer index in the method to select specific row/column. This method does not include the last element of the range passed in it unlike loc(). iloc() does not accept the boolean data unlike loc().


In [20]:
df.loc[['Row1','Row2','Row4']]

Unnamed: 0,Column1,column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row4,12,13,14,15


In [21]:
df.loc[ (df.Column1<5) | (df.Column4>14) ]

Unnamed: 0,Column1,column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row4,12,13,14,15
Row5,16,17,18,19


In [22]:
df.iloc[1:4,1:]

Unnamed: 0,column2,Column3,Column4
Row2,5,6,7
Row3,9,10,11
Row4,13,14,15


In [23]:
# Converting the DF into array
df.iloc[1:,:].values

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

**Series**<br>
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.<br>
Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

In [24]:
type(df.loc['Row1'])

pandas.core.series.Series

In [25]:
df = pd.read_csv('mercedesbenz.csv', index_col=0)

In [26]:
df.head()

Unnamed: 0_level_0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,130.81,k,v,at,a,d,u,j,o,0,...,0,0,1,0,0,0,0,0,0,0
6,88.53,k,t,av,e,d,y,l,o,0,...,1,0,0,0,0,0,0,0,0,0
7,76.26,az,w,n,c,d,x,j,x,0,...,0,0,0,0,0,0,1,0,0,0
9,80.62,az,t,n,f,d,x,l,e,0,...,0,0,0,0,0,0,0,0,0,0
13,78.02,az,v,n,f,d,h,d,n,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
df['X0'].unique()

array(['k', 'az', 't', 'al', 'o', 'w', 'j', 'h', 's', 'n', 'ay', 'f', 'x',
       'y', 'aj', 'ak', 'am', 'z', 'q', 'at', 'ap', 'v', 'af', 'a', 'e',
       'ai', 'd', 'aq', 'c', 'aa', 'ba', 'as', 'i', 'r', 'b', 'ax', 'bc',
       'u', 'ad', 'au', 'm', 'l', 'aw', 'ao', 'ac', 'g', 'ab'],
      dtype=object)

In [28]:
arr = np.array( df['X0'].unique())
arr

array(['k', 'az', 't', 'al', 'o', 'w', 'j', 'h', 's', 'n', 'ay', 'f', 'x',
       'y', 'aj', 'ak', 'am', 'z', 'q', 'at', 'ap', 'v', 'af', 'a', 'e',
       'ai', 'd', 'aq', 'c', 'aa', 'ba', 'as', 'i', 'r', 'b', 'ax', 'bc',
       'u', 'ad', 'au', 'm', 'l', 'aw', 'ao', 'ac', 'g', 'ab'],
      dtype=object)

In [29]:
arr.size

47

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4209 entries, 0 to 8417
Columns: 377 entries, y to X385
dtypes: float64(1), int64(368), object(8)
memory usage: 12.1+ MB


In [31]:
df.describe()

Unnamed: 0,y,X10,X11,X12,X13,X14,X15,X16,X17,X18,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,100.669318,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,0.00784,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,12.679381,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,0.088208,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,90.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,99.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,109.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,265.32,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [32]:
df['X0'].value_counts()

z     360
ak    349
y     324
ay    313
t     306
x     300
o     269
f     227
n     195
w     182
j     181
az    175
aj    151
s     106
ap    103
h      75
d      73
al     67
v      36
af     35
m      34
ai     34
e      32
ba     27
at     25
a      21
ax     19
i      18
aq     18
am     18
u      17
l      16
aw     16
ad     14
k      11
b      11
au     11
as     10
r      10
bc      6
ao      4
c       3
aa      2
q       2
ab      1
g       1
ac      1
Name: X0, dtype: int64

**Read JSON to CSV**


In [33]:
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data',header=None)

In [34]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [35]:
# Convert JSON to CSV

data.to_csv('wine.csv')

In [36]:
# Reading HTML Content

url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/banklist.html'
dfs = pd.read_html(url)

# Here read_html() returns the list of all tables present on the page.


In [37]:
# We output the first table from list
dfs[0]

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date
0,Almena State Bank,Almena,KS,15426,Equity Bank,"October 23, 2020"
1,First City Bank of Florida,Fort Walton Beach,FL,16748,"United Fidelity Bank, fsb","October 16, 2020"
2,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.","April 3, 2020"
3,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,"February 14, 2020"
4,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,"November 1, 2019"
...,...,...,...,...,...,...
558,"Superior Bank, FSB",Hinsdale,IL,32646,"Superior Federal, FSB","July 27, 2001"
559,Malta National Bank,Malta,OH,6629,North Valley Bank,"May 3, 2001"
560,First Alliance Bank & Trust Co.,Manchester,NH,34264,Southern New Hampshire Bank & Trust,"February 2, 2001"
561,National State Bank of Metropolis,Metropolis,IL,3815,Banterra Bank of Marion,"December 14, 2000"


In [38]:
hiragana_url = 'https://en.wikipedia.org/wiki/Hiragana'
dfh = pd.read_html(hiragana_url, match='Hiragana syllabograms', index_col=0)

In [39]:
dfh[0].head(n=10)

Unnamed: 0_level_0,Monographs (gojūon),Monographs (gojūon),Monographs (gojūon),Monographs (gojūon),Monographs (gojūon),Digraphs (yōon),Digraphs (yōon),Digraphs (yōon)
Unnamed: 0_level_1,a,i,u,e,o,ya,yu,yo
∅,あ a [a],い i [i],う u [ɯ],え e [e],お o [o],,,
K,か ka [ka],き ki [ki],く ku [kɯ],け ke [ke],こ ko [ko],きゃ kya [kʲa],きゅ kyu [kʲɯ],きょ kyo [kʲo]
S,さ sa [sa],し shi [ɕi],す su [sɯ],せ se [se],そ so [so],しゃ sha [ɕa],しゅ shu [ɕɯ],しょ sho [ɕo]
T,た ta [ta],ち chi [tɕi],つ tsu [tsɯ],て te [te],と to [to],ちゃ cha [tɕa],ちゅ chu [tɕɯ],ちょ cho [tɕo]
N,な na [na],に ni [ɲi],ぬ nu [nɯ],ね ne [ne],の no [no],にゃ nya [ɲa],にゅ nyu [ɲɯ],にょ nyo [ɲo]
H,は ha [ha] ([ɰa] as particle),ひ hi [çi],ふ fu [ɸɯ],へ he [he] ([e] as particle),ほ ho [ho],ひゃ hya [ça],ひゅ hyu [çɯ],ひょ hyo [ço]
M,ま ma [ma],み mi [mi],む mu [mɯ],め me [me],も mo [mo],みゃ mya [mʲa],みゅ myu [mʲɯ],みょ myo [mʲo]
Y,や ya [ja],[6],ゆ yu [jɯ],[6],よ yo [jo],,,
R,ら ra [ɾa],り ri [ɾi],る ru [ɾɯ],れ re [ɾe],ろ ro [ɾo],りゃ rya [ɾʲa],りゅ ryu [ɾʲɯ],りょ ryo [ɾʲo]
W,わ wa [ɰa],( ゐ )wi [i],[6],( ゑ )we [e],をwo [o],,,


**Pickling** <br>
Pickling is the process whereby a Python object hierarchy is converted into a byte stream, and unpickling is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.<br><br>
All Python objects are equipped with to_pickle() methods which use Python's cPickle module to save Data Strctures to disk using the pickle format.