# Numpy
NumPy is an open source library available in Python that aids in mathematical, scientific, engineering, and data science programming. NumPy is an incredible library to perform mathematical and statistical operations. It works perfectly well for multi-dimensional arrays and matrices multiplication.


NumPy is a programming language that deals with multi-dimensional arrays and matrices. On top of the arrays and matrices, NumPy supports a large number of mathematical operations. In this part, we will review the essential functions that you need to know for the tutorial on 'Numpy.'


The library’s name is actually short for "Numeric Python" or "Numerical Python".

# Why NumPy ?
NumPy is memory efficiency, meaning it can handle the vast amount of data more accessible than any other library. Besides, NumPy is very convenient to work with, especially for matrix multiplication and reshaping. On top of that, NumPy is fast. In fact, Scikit learn(ML library) to use NumPy array to compute the matrix multiplication in the back end.

#### Import NumPy
Before start using numpy in your code, you have to import it

In [4]:
import numpy as np

Most commonly, numpy is imported with the alias 'np'.

# Numpy Array
NumPy arrays are a bit like Python lists, but still very much different at the same time. As the name kind of gives away, a NumPy array is a central data structure of the numpy library.

#### Create a NumPy Array
Simplest way to create an array in Numpy is to use Python List


In [6]:
myPythonList = [1,9,8,3]

To convert python list to a numpy array by using the object <b> np.array </b>.

In [7]:
numpy_array_from_list = np.array(myPythonList)

To display the contents of the list

In [8]:
print(numpy_array_from_list)

[1 9 8 3]


In [9]:
print(type(numpy_array_from_list))

<class 'numpy.ndarray'>


As we can see 'numpy_array_from_list' is now a numpy array.

In practice, there is no need to declare a Python List. The operation can be combined.

In [11]:
a  = np.array([1,9,8,3])
print(a)

[1 9 8 3]


#### Mathematical Operations on an Array
You could perform mathematical operations like additions, subtraction, division and multiplication on an array. The syntax is the array name followed by the operation (+.-,*,/) followed by the operand

In [12]:
print(numpy_array_from_list + 10)

[11 19 18 13]


This operation adds 10 to each element of the numpy array.

#### Shape of Array
You can check the shape of the array with the object shape preceded by the name of the array. In the same way, you can check the type with dtypes.

In [13]:
import numpy as np
a  = np.array([1,2,3])
print(a.shape)
print(a.dtype)

(3,)
int32


An integer is a value without decimal. If you create an array with decimal, then the type will change to float.

In [14]:
#### Different type
b  = np.array([1.1,2.0,3.2])
print(b.dtype)

float64


#### 2-Dimension Array
You can add a dimension with a ","coma.
Note that it has to be within the bracket [].


In [15]:
### 2 dimension
c = np.array([(1,2,3),(4,5,6)])
print(c.shape)

(2, 3)


Similarly, higher dimensions arrays can also be created.

#### np.zeros and np.ones
You can create a matrix full of zeroes or ones using np.zeros and np.one commands respectively. It can be used when you initialized the weights during the machine learning training and other statistic tasks.

In [19]:
print(np.zeros((2,2)))

[[0. 0.]
 [0. 0.]]


In [20]:
import numpy as np
print(np.ones((1,2,3)))

[[[1. 1. 1.]
  [1. 1. 1.]]]


#### Reshape Data
In some occasions, you need to reshape the data from wide to long. You can use the reshape function for this.


In [22]:
import numpy as np
e  = np.array([(1,2,3), (4,5,6)])
print(e)

[[1 2 3]
 [4 5 6]]


In [24]:
print(e.reshape(3,2))

[[1 2]
 [3 4]
 [5 6]]


#### Flatten Data
When you need to deal with some advanced machine leaning algorithms like neural network, you need to flatten the array. You can use flatten().


In [25]:
print(e.flatten())

[1 2 3 4 5 6]


#### Random Numbers Generation
To generate random numbers for different distributions, we can use different random functions. 

In [38]:
## Generate random nmber from Gaussian distribution
standard_array = np.random.normal(0, 1, 10)
print(standard_array)

[-0.42961988 -1.92779115  0.55812445 -1.2276275   1.12331334  1.1420527
  0.0739283  -0.79199639  1.35543983 -0.3454129 ]


If plotted the distribution will be similar to following plot :
<img src = "numpy_tut1.png">

In [42]:
## Generate random nmber from normal distribution
# first number denotes mean
# second number denotes standard deviation
# third number denotes size of the array
normal_array = np.random.normal(5, 0.5, 10)
print(normal_array)

[5.39297383 5.32995191 6.62958538 4.63684064 4.85598547 5.19460427
 4.77667286 5.05480573 5.47624144 4.58568941]


In [43]:
## Generate random nmber from uniform distribution
# first number denotes low
# second number denotes high
# third number denotes size of the array
uniform_array = np.random.uniform(0,1,10)
print(uniform_array)

[0.29985818 0.0153101  0.22345268 0.69973202 0.48216496 0.82479835
 0.43843043 0.59733783 0.92008554 0.38713758]


#### numpy.arange() 
Sometimes, you want to create values that are evenly spaced within a defined interval. For instance, you want to create values from 1 to 10; you can use numpy.arange() function

In [47]:
# first number denotes Start of interval
# second number denotes End of interval
# third number, if present, denotes spacing between values. Default step is 1
print(np.arange(1, 11))

[ 1  2  3  4  5  6  7  8  9 10]


If you want to change the step, you can add a third number in the parenthesis. It will change the step.

In [48]:
print(np.arange(1, 14, 4))

[ 1  5  9 13]


#### numpy.linspace()
Linspace gives evenly spaced samples.

In [49]:
# Start: Starting value of the sequence
# Stop: End value of the sequence
# Num: Number of samples to generate. Default is 50
# Endpoint: If True (default), stop is the last value. If False, stop value is not included.
np.linspace(1.0, 5.0, num=10)


array([1.        , 1.44444444, 1.88888889, 2.33333333, 2.77777778,
       3.22222222, 3.66666667, 4.11111111, 4.55555556, 5.        ])

#### Indexing and Slicing NumPy Arrays 
Slicing data is trivial with numpy. We will slice the matrice "e". Note that, in Python, you need to use the brackets to return the rows or columns

In [50]:
import numpy as np
e  = np.array([(1,2,3), (4,5,6)])
print(e)

[[1 2 3]
 [4 5 6]]


In [51]:
## First column
print('First row:', e[0])

First row: [1 2 3]


In [52]:
## Second col
print('Second row:', e[1])

Second row: [4 5 6]


The values before the comma stand for the rows. The value on the rights stands for the columns.
If you want to select a column, you need to add : before the column index.
: means you want all the rows from the selected column.

In [53]:
print('Second column:', e[:,1])

Second column: [2 5]


To return the first two values of the second row. You use : to select all columns up to the second

In [54]:
## Second Row, two values
print(e[1, :2])

[4 5]


#### NumPy Statistical Functions 
NumPy has quite a few useful statistical functions for finding minimum, maximum, percentile standard deviation and variance, etc from the given elements in the array.


In [55]:
normal_array = np.random.normal(5, 0.5, 10)
print(normal_array)

### Min 
print(np.min(normal_array))

### Max 
print(np.max(normal_array))

### Mean 
print(np.mean(normal_array))

### Median
print(np.median(normal_array))

### Sd
print(np.std(normal_array))


[5.15365594 4.61426302 4.42788193 6.16715473 4.27644805 4.42518478
 5.45294927 4.55114002 5.33408259 4.32080543]
4.276448045180074
6.167154725567189
4.872356574230242
4.582701516789127
0.5943516378347707


# Pandas
Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. 

Pandas deals with the following data structures −

* Series
* DataFrame



#### Series
Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

In [73]:
my_series = pd.Series([10,23,56,78])
print(my_series)

0    10
1    23
2    56
3    78
dtype: int64


In [74]:
print(type(my_series))

<class 'pandas.core.series.Series'>


It can have any data structure like integer, float, and string. It is useful when you want to perform computation or return a one-dimensional array. A series, by definition, cannot have multiple columns.

You can add the index with index. It helps to name the rows. The length should be equal to the size of the column.

In [79]:
my_series = pd.Series([1., 2., 3.], index=['a', 'b', 'c'])
print(my_series)

a    1.0
b    2.0
c    3.0
dtype: float64


### Pandas Data Frames

They are defined as two-dimensional labeled data structures with columns of potentially different types.

The Pandas DataFrame consists of three main components: the data, the index, and the columns.

The Pandas library is usually imported under the alias pd.


In [56]:
import pandas as pd

#### Create Data frame
You can convert a numpy array to a pandas data frame with pd.DataFrame(). The opposite is also possible. To convert a pandas Data Frame to an array, you can use np.array()


In [80]:
## Numpy to pandas
h = [[1,2],[3,4]] 
df_h = pd.DataFrame(h)
print('Data Frame:', df_h)

Data Frame:    0  1
0  1  2
1  3  4


In [81]:
## Pandas to numpy
df_h_n = np.array(df_h)
print('Numpy array:', df_h_n)

Numpy array: [[1 2]
 [3 4]]


You can also use a dictionary to create a Pandas dataframe.

In [82]:
dic = {'Name': ["John", "Smith"], 'Age': [30, 40]}
pd.DataFrame(data=dic)

Unnamed: 0,Name,Age
0,John,30
1,Smith,40


### Importing Data with read_csv()
The first step to any data science project is to import your data. Often, you'll work with data in Comma Separated Value (CSV) files and run into problems at the very start of your workflow. 



Before you can use pandas to import your data, you need to know where your data is in your filesystem and what your current working directory is.

# OS library
Python allows the developer to use several OS-dependent functionalities with the Python module <b>os</b>. This package abstracts the functionalities of the platform and provides the python functions to navigate, create, delete and modify files and folders.


In [57]:
import os

#### getcwd()
Now, using the getcwd method, we can retrieve the path of the current working directory.

In [58]:
print(os.getcwd())

C:\Users\ab275\Downloads\edureka_freelance\gaurav ipynb\python1


Let's list the folders/files in the current directory using listdir:

In [59]:
print(os.listdir())

['.ipynb_checkpoints', '01_initial_notebook_screen.cb2ea87d9679.png', 'Basics of Python Part 1.ipynb', 'Basics of Python Part 2.ipynb', 'Basics of Python Part 3.ipynb', 'Basics of Python Part 4.ipynb', 'calculation.py', 'continue-statement-flowchart.jpg', 'flowchart-break-statement.jpg', 'forLoop.jpg', 'hello.png', 'numpy_tut1.png', 'osx-install-destination.png', 'osx-install-success.png', 'osx-install-type.png', 'reduce_diagram.png', 'test.txt', 'whileLoopFlowchart.jpg', 'win-install-complete.png']


Let's change the working directory and enter into the directory of data.

In [61]:
os.chdir('Data/')

#### Loading your data
Now that you know what your current working directory is and where the dataset is in your filesystem, you can specify the file path to it. You're now ready to import the CSV file into Python using read_csv() from pandas:


In [83]:
credit_data = pd.read_csv("Credit_Pay.csv")

In the code above, the file path is the main argument to read_csv() and it was specified as the relative file path. The read_csv() function is smart enough to decipher whether it's working with full or relative file paths and convert your flat file as a DataFrame without a problem.

# Inspecting data

* Load the data
* Using head function - to print first few rows of the dataframe
* Using tail function - to print last few rows of the dataframe
* Using describe function - to get the counts, mean, std, min, max and percentile of the dataframe
* Using info function - to see the datatypes of the columns of the dataframe

Data has been load already using read.csv() in the object "credit_data"

In [85]:
print(credit_data.head())

   ID  LIMIT_BAL     SEX      EDUCATION MARRIAGE  AGE  REPAYMENT_JUL  \
0   1    20000.0  Female  Post Graduate  Married   24              2   
1   2   120000.0  Female  Post Graduate   Single   26             -1   
2   3    90000.0  Female  Post Graduate   Single   34              0   
3   4    50000.0  Female  Post Graduate  Married   37              0   
4   5    50000.0    Male  Post Graduate  Married   57             -1   

   REPAYMENT_JUN  REPAYMENT_MAY  REPAYMENT_APR  ...  BILL_AMT_MAY  \
0              2             -1             -1  ...         689.0   
1              2              0              0  ...        2682.0   
2              0              0              0  ...       13559.0   
3              0              0              0  ...       49291.0   
4              0             -1              0  ...       35835.0   

   BILL_AMT_APR  BILL_AMT_MAR  BILL_AMT_FEB  PAY_AMT_JUL  PAY_AMT_JUN  \
0           0.0           0.0           0.0          0.0        689.0   
1     

In [86]:
print(credit_data.tail())

          ID  LIMIT_BAL   SEX      EDUCATION MARRIAGE  AGE  REPAYMENT_JUL  \
29995  29996   220000.0  Male    High School  Married   39              0   
29996  29997   150000.0  Male    High School   Single   43             -1   
29997  29998    30000.0  Male  Post Graduate   Single   37              4   
29998  29999    80000.0  Male    High School  Married   41              1   
29999  30000    50000.0  Male  Post Graduate  Married   46              0   

       REPAYMENT_JUN  REPAYMENT_MAY  REPAYMENT_APR  ...  BILL_AMT_MAY  \
29995              0              0              0  ...      208365.0   
29996             -1             -1             -1  ...        3502.0   
29997              3              2             -1  ...        2758.0   
29998             -1              0              0  ...       76304.0   
29999              0              0              0  ...       49764.0   

       BILL_AMT_APR  BILL_AMT_MAR  BILL_AMT_FEB  PAY_AMT_JUL  PAY_AMT_JUN  \
29995       88004.0  

In [87]:
print(credit_data.describe())

                 ID       LIMIT_BAL           AGE  REPAYMENT_JUL  \
count  30000.000000    30000.000000  30000.000000   30000.000000   
mean   15000.500000   167484.322667     35.485500      -0.016700   
std     8660.398374   129747.661567      9.217904       1.123802   
min        1.000000    10000.000000     21.000000      -2.000000   
25%     7500.750000    50000.000000     28.000000      -1.000000   
50%    15000.500000   140000.000000     34.000000       0.000000   
75%    22500.250000   240000.000000     41.000000       0.000000   
max    30000.000000  1000000.000000     79.000000       8.000000   

       REPAYMENT_JUN  REPAYMENT_MAY  REPAYMENT_APR  REPAYMENT_MAR  \
count   30000.000000   30000.000000   30000.000000   30000.000000   
mean       -0.133767      -0.166200      -0.220667      -0.266200   
std         1.197186       1.196868       1.169139       1.133187   
min        -2.000000      -2.000000      -2.000000      -2.000000   
25%        -1.000000      -1.000000      -

In [88]:
print(credit_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
ID               30000 non-null int64
LIMIT_BAL        30000 non-null float64
SEX              30000 non-null object
EDUCATION        30000 non-null object
MARRIAGE         30000 non-null object
AGE              30000 non-null int64
REPAYMENT_JUL    30000 non-null int64
REPAYMENT_JUN    30000 non-null int64
REPAYMENT_MAY    30000 non-null int64
REPAYMENT_APR    30000 non-null int64
REPAYMENT_MAR    30000 non-null int64
REPAYMENT_FEB    30000 non-null int64
BILL_AMT_JUL     30000 non-null float64
BILL_AMT_JUN     30000 non-null float64
BILL_AMT_MAY     30000 non-null float64
BILL_AMT_APR     30000 non-null float64
BILL_AMT_MAR     30000 non-null float64
BILL_AMT_FEB     30000 non-null float64
PAY_AMT_JUL      30000 non-null float64
PAY_AMT_JUN      30000 non-null float64
PAY_AMT_MAY      30000 non-null float64
PAY_AMT_APR      30000 non-null float64
PAY_AMT_MAR      30000 non-nul

In [90]:
# To see if any column has NaN or Null values
print(credit_data.isnull().sum())

ID               0
LIMIT_BAL        0
SEX              0
EDUCATION        0
MARRIAGE         0
AGE              0
REPAYMENT_JUL    0
REPAYMENT_JUN    0
REPAYMENT_MAY    0
REPAYMENT_APR    0
REPAYMENT_MAR    0
REPAYMENT_FEB    0
BILL_AMT_JUL     0
BILL_AMT_JUN     0
BILL_AMT_MAY     0
BILL_AMT_APR     0
BILL_AMT_MAR     0
BILL_AMT_FEB     0
PAY_AMT_JUL      0
PAY_AMT_JUN      0
PAY_AMT_MAY      0
PAY_AMT_APR      0
PAY_AMT_MAR      0
PAY_AMT_FEB      0
dtype: int64


WooHoo! Our data is completely free of NaNs and Null values.

#### Slice data
You can use the column name to extract data in a particular column.

In [91]:
### Using name
credit_data['EDUCATION']

0        Post Graduate
1        Post Graduate
2        Post Graduate
3        Post Graduate
4        Post Graduate
             ...      
29995      High School
29996      High School
29997    Post Graduate
29998      High School
29999    Post Graduate
Name: EDUCATION, Length: 30000, dtype: object

To select multiple columns, you need to use two times the bracket, [[..,..]]

The first pair of bracket means you want to select columns, the second pairs of bracket tells what columns you want to return.

In [92]:
print(credit_data[['EDUCATION','MARRIAGE']])

           EDUCATION MARRIAGE
0      Post Graduate  Married
1      Post Graduate   Single
2      Post Graduate   Single
3      Post Graduate  Married
4      Post Graduate  Married
...              ...      ...
29995    High School  Married
29996    High School   Single
29997  Post Graduate   Single
29998    High School  Married
29999  Post Graduate  Married

[30000 rows x 2 columns]


The code below returns the first three rows.