## Pandas Introduction

pandas is a software library written for the python programming language for data manipulation and analysis

# Key Features of Pandas:

1) Fast and efficient Data Frame object with default and customized indexing.

2) Tools for loading data into in-memory data objects from different file formats.

3) Data alignment and integrated handling of missing data.

4) Reshaping of data sets.

5) Label-based slicing, indexing and sub-setting of large data sets.

6) Columns from a data structure can be deleted or inserted.

7) Group by data for aggregation and transformations.

8) High performance merging and joining of data.

## How Things Work With Pandas

To get a proper understanding about the data the below things can be done with pandas.

1. Import Libraries
2. How to create Series and Dataframes
3. Importing Data
4. Data Validation
5. Data Selection
6. Handle Missing Values
7. Grouping
8. Concatination
9. Merging
10. Reshaping

# How to import pandas in the program

import pandas

### Here we are storing pandas library into variable pd

In [1]:
import pandas as pd

#### Command to install pandas

#### pip install pandas

#### Import Pandas Library, So what is a library?

In [2]:
# We were giving a nick name to pandas as pd so that it can be recalled easily

import pandas as pd
import numpy as np
from pandas import Series, DataFrame # This saves us from typing 'pf.Series' and 'pd.DataFrame' each time

# In Pandas we need to remember few things those are:

1) Series

2) Data Frame

3) Loading Data from different file formats

4) Data Manipulation

# 2) Data Frame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

### Features of DataFrame

    1) Potentially columns are of different types

    2) Size – Mutable

    3) Labeled axes (rows and columns)

    4) Can Perform Arithmetic operations on rows and columns


# How to create Data Frame by using variable

### Storing Dictonary into variable

In [3]:
import pandas as pd

In [4]:
web={'Day' : [1,2,3,4,5,6], 'visitors':[1000,2000,3000,2504,5125,1256]}
web

{'Day': [1, 2, 3, 4, 5, 6], 'visitors': [1000, 2000, 3000, 2504, 5125, 1256]}

### By using the Pandas converting Dict into the Data Frame

In [5]:
df=pd.DataFrame(web)
df

Unnamed: 0,Day,visitors
0,1,1000
1,2,2000
2,3,3000
3,4,2504
4,5,5125
5,6,1256


# Pandas Operations

1) Slicing the DataFrame

2) Changing the Index

3) Data conversion

4) Joining and Merging

5) Concatenation

6) Changing the column headers

# 1) Slicing

In [6]:
import pandas as pd

In [7]:
Bank={'Year': [2001,2002,2003,2004,2005,2006,2007], 'Int_rate':[2,3,2,2,1,3,6], 'US_GDP_Thousands': [50,55,65,55,60,65,54]}

In [8]:
df = pd.DataFrame(Bank)
df

Unnamed: 0,Year,Int_rate,US_GDP_Thousands
0,2001,2,50
1,2002,3,55
2,2003,2,65
3,2004,2,55
4,2005,1,60
5,2006,3,65
6,2007,6,54


In [9]:
df.shape

(7, 3)

# Head
Default head method displaying first five rows from data frame

In [10]:
#Default it going to print starting 5 rows

df.head()

Unnamed: 0,Year,Int_rate,US_GDP_Thousands
0,2001,2,50
1,2002,3,55
2,2003,2,65
3,2004,2,55
4,2005,1,60


In [11]:
# If I want to print starting 3 rows then I passing value to the head

df.head(3)

Unnamed: 0,Year,Int_rate,US_GDP_Thousands
0,2001,2,50
1,2002,3,55
2,2003,2,65


# Tail
Default tail method displaying last five rows from data frame

In [12]:
#If I want to print last rows then I am going to use tail
df.tail()

Unnamed: 0,Year,Int_rate,US_GDP_Thousands
2,2003,2,65
3,2004,2,55
4,2005,1,60
5,2006,3,65
6,2007,6,54


In [13]:
#If I want to print ending 3 rows then I passing value to the tail

df.tail(3)

Unnamed: 0,Year,Int_rate,US_GDP_Thousands
4,2005,1,60
5,2006,3,65
6,2007,6,54


# Merging

In [14]:
# Here we are going to take two data frames

df1=pd.DataFrame({'HPI':[80,90,70,60],'rate':[2,1,2,3],'GDP':[50,55,65,55]})
df1

Unnamed: 0,HPI,rate,GDP
0,80,2,50
1,90,1,55
2,70,2,65
3,60,3,55


In [15]:
df2=pd.DataFrame({'HPI': [50,70,90,60], 'rate': [2,2,1,3], 'GDN': [50,65,55,54]})
df2

Unnamed: 0,HPI,rate,GDN
0,50,2,50
1,70,2,65
2,90,1,55
3,60,3,54


In [16]:
# Here I am going to merge the two data frames

df3=pd.merge(df1,df2) # only similar entities are combine
df3

Unnamed: 0,HPI,rate,GDP,GDN
0,90,1,55,55
1,70,2,65,65
2,60,3,55,54


In [17]:
df5=pd.merge(df1,df2, on="rate") # on the basis of rate we have to combine
df5

Unnamed: 0,HPI_x,rate,GDP,HPI_y,GDN
0,80,2,50,50,50
1,80,2,50,70,65
2,70,2,65,50,50
3,70,2,65,70,65
4,90,1,55,90,55
5,60,3,55,60,54


In [18]:
df1

Unnamed: 0,HPI,rate,GDP
0,80,2,50
1,90,1,55
2,70,2,65
3,60,3,55


In [19]:
df2

Unnamed: 0,HPI,rate,GDN
0,50,2,50
1,70,2,65
2,90,1,55
3,60,3,54


# Joining

Based on the Index value it going to add

Based on the first data frame, it going to check the index and joing the values

In [20]:
df6=pd.DataFrame({'HPI':[80,90,70,60], 'rate':[2,1,2,3]})
df6

Unnamed: 0,HPI,rate
0,80,2
1,90,1
2,70,2
3,60,3


In [21]:
df7=pd.DataFrame({'CPI':[80,90,70],'Years':[2,1,5]})
df7

Unnamed: 0,CPI,Years
0,80,2
1,90,1
2,70,5


In [22]:
df8=df6.join(df7)
df8

Unnamed: 0,HPI,rate,CPI,Years
0,80,2,80.0,2.0
1,90,1,90.0,1.0
2,70,2,70.0,5.0
3,60,3,,


# 2) Changing the Index & Column Headers

# Changing the Index

In [23]:
df12=pd.DataFrame({'Day': [1,2,3,4,5,6], 'visitors': [1000,2000,3000,2504,5125,1256], 'Bounce_Rate': [10,20,30,14,50,6]})
df12

Unnamed: 0,Day,visitors,Bounce_Rate
0,1,1000,10
1,2,2000,20
2,3,3000,30
3,4,2504,14
4,5,5125,50
5,6,1256,6


In [24]:
df12.set_index("Day", inplace=True)
df12

Unnamed: 0_level_0,visitors,Bounce_Rate
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1000,10
2,2000,20
3,3000,30
4,2504,14
5,5125,50
6,1256,6


In [25]:
df12.set_index("Day") # change in original dataset

KeyError: "None of ['Day'] are in the columns"

In [None]:
df12.set_index("visitors") # not change in original dataset

In [None]:
df12

In [None]:
df13=df12.set_index("visitors")
df13

# Changing the column Name

In [None]:
df12=pd.DataFrame({'Day': [1,2,3,4,5,6], 'visitors': [1000,2000,3000,2504,5125,1256], 'Bounce_Rate': [10,20,30,14,50,6]})
df12

In [None]:
df13=df12.rename(columns={"Day":"No of Days"})
df13

In [None]:
df12

# 3) Concatenation

In [None]:
df14=pd.DataFrame({'HPI': [80,90,70,60], 'Int_rates': [2,1,2,3]}, index=[2001,2002,2004,2005])
df14

In [None]:
df15=pd.DataFrame({'HPI': [80,90,70,60], 'Years': [2,1,2,3], 'Int_rate': [2,1,2,3]}, index=[2002,2006,2007,2008])
df15

In [None]:
df17=pd.concat([df15,df14], sort=True)
df17

# How to load (Read the file) in th Pandas

In [None]:
df18=pd.read_csv("D:/C DRIVE-SSD DATA backup 15-12-2020/Downloads/Data.csv")
df18

# Converting one file format to other

In [None]:
df18.to_html('demo14.html')

In [None]:
df18.max()

In [None]:
df18.min()

In [None]:
df18.count()

In [None]:
df18['Age'].max()

In [None]:
df18['Age'].min()

In [None]:
df18['Age'].count()

In [None]:
df18

In [None]:
df19=df18['Country']=='Spain'
df19

In [None]:
df18

In [None]:
df18['Country'][df18['Purchased']=='Yes']

In [None]:
df19=df18['Age'].mean()

In [None]:
df19

In [None]:
df20=df18['Age'].median()
df20

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8,4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

In [None]:
df

In [None]:
df.loc[['a','b','f','h'],['A','C']]

In [None]:
df.loc['a':'c']

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

df


In [None]:
df.iloc[:4]

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a','c','e','f','h'],columns=['one','two','three'])


In [None]:
df

In [None]:
df = df.reindex(['a','b','c','d','e','f','g','h'])

df

In [None]:
df['one'].isnull()

In [None]:
df.isnull()

In [None]:
df['one'].isnull().sum()

In [None]:
df.isnull().sum()

In [None]:
df.isnull().any()

In [None]:
df

In [None]:
df.fillna(1)

In [None]:
df

In [None]:
df.ffill()

In [None]:
df

In [None]:
df.bfill()

# Row wise

In [None]:
df

In [None]:
df1 = df.dropna()
df1

# Column Wise

In [None]:
df

In [None]:
df2 = df.dropna(axis=1)

In [None]:
df2

# Replace

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
df

In [None]:
df.replace({0:10,2000:60})

In [None]:
df.describe()

In [None]:
# We were giving a nick name to pandas as pd so that it can be recalled easily
import pandas as pd
import numpy as np
from pandas import Series, DataFrame # This saves us from typing 'pd.Series' and 'pd.DataFrame' each time

### Create Series and Dataframe
- With "Series" function" in Pandas we can create a series
- With ""DataFrame" function" in Pandas we can create a data frame

In [None]:
# Creating a Series (an array of data values and their index)
obj = Series([3,6,9,12])
obj

In [None]:
obj.values # It shows the values in a series

In [None]:
obj.index # It shows the index range

### Creating a Series with a named index

In [None]:
coins = Series([.01,.05,.10,.25], index=['penny','nickel','dime','quarter'])
coins

In [None]:
coins['dime']

In [None]:
coins[coins > 0.7]

In [None]:
# Custom Index
fruits = ['apples', 'oranges', 'cherries', 'pears', 'Mango']
quantities = [10,20,30,40,50]
S = pd.Series(quantities, index=fruits)
S

In [None]:
S['Mango']

In [None]:
# Creating a DataFrame
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

In [None]:
# Constructing a DataFrame from a Dictionary

data = {'City':['SF','LA','NYC'],'Population':[837000,3880000,8400000]}
city_frame = DataFrame(data)

In [None]:
city_frame

In [None]:
# Creating a Data Frame by Passing Lists

data = [['Mahesh',40],['Pawan',50],['Prabhas',35]]
df = pd.DataFrame(data,columns=['Actor_Name','Age'])

In [None]:
print(df)

## 1. Reading

#### First Set Path Names: Setting path names

Set commonly used directories as raw data strings in the code

In [None]:
import os # OS module provides a way of using operating system dependent functionality
import pandas as pd
# os.getcwd()
os.chdir("F:/Class/My_Class/Class_Data/Pandas") # changes the current working directory to the given path.

### Importing Data

Below is the Syntax if the path was saved

file1 = pd.read_csv(path +'file.csv') # Preferred type of loading data

#### Import a .csv file (Comma Separated Value Files)

In [None]:
csv1 = pd.read_csv("Iris.csv")

In [None]:
# or give the whole path name to the file and then load it
csv1 = pd.read_csv("D:/C DRIVE-SSD DATA backup 15-12-2020/Desktop/360digitmg material/Data Preprocessing/DataSets-Data Pre Processing/DataSets/iris.csv")

In [None]:
csv1.head(3)

### Different ways of Importing csv

- pd.read_csv("Iris.csv") is used to load our data into python
- pd.read_csv("Iris.csv", skiprows=1) # Skips the first row
- pd.read_csv("Iris.csv", header=1) # Skips header
- pd.read_csv("Iris.csv", header=None, names = ["ticker","eps","revenue","peopl
e"])
- pd.read_csv("Iris.csv", nrows=2) # Reading only first 2 rows
- pd.read_csv("Iris.csv", na_values=["n.a.", "not available"]) # Telling what NA
values are to python
- pd.read_csv("Iris.csv",parse_dates=['day']) # As date column is taken as strin
g we will to take as Date data type

### Importing: XLSX files
XLSX is a Microsoft Excel Open XML file format also known as Spreadsheet file format.

In [None]:
xml1 = pd.read_excel("D:/C DRIVE-SSD DATA backup 15-12-2020/Desktop/360digitmg material/Data Preprocessing/DataSets-Data Pre Processing/DataSets/Assignment_module02 (1).xlsx") # sheetname = "Shahina"

In [None]:
xml1.head(15)

## Import HTML Files

- HTML stands for Hyper Text Markup Language.
- It is the standard markup language which is used for creating Web pages.
- HTML is used to describe structure of web pages using markup.
- We need beautiful-soup and html5lib installed

In [None]:
# We can import data using read_html

from pandas import read_html
import pandas as pd
from pandas import Series, DataFrame

# pip install html5lib
# pip install beautifulsoup4

In [None]:
#Lets grab a url for list of failed banks
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'

In [None]:
# Read on HTML table into a DataFrame
dframe_list = pd.io.html.read_html(url)

In [None]:
# Grab the first list item from the data base and set as a DataFrame
dframe = dframe_list[0]

In [None]:
# Show
dframe

## 2. Data Validation

Lets import our IRIS data for this

## Viewing Data

- df.head(3) # Displays first 6 rows
- df.tail(3) # Displays last 3 rows
- df.columns # Names of the columns
- df.shape # Number of rows and columns
- df.values # Displays values of data

In [None]:
# Import Data (Iris Data)
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
# Load iris into a dataframe and set the field names
df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
df.head()

## To View Data

In [None]:
df.head()

In [None]:
df.tail(3)

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.values

In [None]:
df.describe()

In [None]:
# Anaconda Poershell -  pip install vega_datasets
from vega_datasets import data
data.list_datasets()

## Data Selection [ ]

- Pandas offers a wide variety of options for subset selection
- So now we will discuss how to slice and dice subsets of pandas object.
- For this we import Titanic data, this is a famous data set from kaggle is about predicting which
passengers survived the sinking of the Titanic.

In [None]:
# Import Data
import pandas as pd
df = pd.read_csv("C:/Users/Admin/Downloads/Classes_Titanic_titanic3.csv")

In [None]:
df.head(3)

In [None]:
df[1:3]

In [None]:
df['sex'].head(10)

In [None]:
df[['sex','age']].head(3) # selecting two columns --- put it in a double braket

## Indexing with Value Selection

In [None]:
df[df['age']>32].head(3) # i want data where age in df more than 32

In [None]:
df['survived'][df['age'] >= 32].head(3)

In [None]:
df['age'].max()

In [None]:
df[df['age'] == df['age'].max()] # give me all col with max age

In [None]:
df['name'][df['age'] == df['age'].max()]

## Pandas has different data access methods.

1. As usual we use indexing operators "[ ]" and attribute operator "." for quick and easy access
2. .loc() which is for Label based indexing (.If you know the column name.)
3. .iloc() which is for Integer based (.If you know the position.)

#### Row Selection

In [None]:
print(df.iloc[0,1]) # loc is for selection of index value

In [None]:
df.iloc[1]

In [None]:
df.iloc[-1]

#### Column Selection

In [None]:
df.iloc[:,0]

In [None]:
df.iloc[:,1]

In [None]:
df.iloc[:,-1]

### Multiple Row and Column Selection

- df.iloc[0:5] # first five rows of dataframe
- df.iloc[:, 0:2] # first two columns of data frame with all rows
- df.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns
- df.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame

In [None]:
df.iloc[0:5]

In [None]:
df.iloc[:, 0:2]

In [None]:
df.iloc[[0,3,6,24], [0,5,6]] # list = [1,2,3,4,5]

In [None]:
df.iloc[0:5, 5:8]

In [None]:
# Now use Label based loc fucntion select all rows for a specific column
df.loc[:,'ticket']

### Droping Data or Deleting Data

###### numpy.arange(start, stop, step, dtype)

In [None]:
#Create a new series
import numpy as np
ser1 = Series(np.arange(3),index=['a','b','c']) # aragnge decide rows

#Show
ser1

In [None]:
# Now let's drop an index
ser1.drop('b') # drop index of "b"

In [None]:
ser1 # not delete from original data

In [None]:
# With aDataFrame we can drop values from either axis (rows and columns)
dframe1 = DataFrame(np.arange(9).reshape((3,3)),index=['SF','LA','NY'],columns=['pop','size','year'])
dframe1

In [None]:
# dropping a row
dframe1.drop('LA', axis=0)

In [None]:
#Or we could drop a column
# Need to specify that axis is 1, not 0 where axis : {0 or 'index', 1 or 'columns'}

dframe1.drop('year', axis=1)

# Missing Value

In [None]:
# how to deal with missing data
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
data = Series(['one','two', np.nan, 'four'])

In [None]:
data

In [None]:
# Find the missing values
data.isnull() 

In [None]:
data.isnull().count()

In [None]:
dframe = DataFrame([[1,2,3],[np.nan,5,6],[7,np.nan,9],[np.nan,np.nan,np.nan]])

In [None]:
#Show
dframe

In [None]:
clean_dframe = dframe.dropna()

In [None]:
clean_dframe

In [None]:
# Drop rows that are complete missing all data
dframe.dropna(how='all')

In [None]:
dframe.dropna(how='any')

In [None]:
# Drop columns with missing data
dframe.dropna(axis=1)

In [None]:
# We only want rows with at least 3 data points
dframe2 = DataFrame([[1,2,3,np.nan],[2,np.nan,5,6],[np.nan,7,np.nan,9],[1,np.nan,np.nan,np.nan]])

#Show
dframe2

In [None]:
dframe2.dropna(thresh=2) # drop any rows, 2 data points of na

In [None]:
dframe2.dropna(thresh=3)

In [None]:
dframe2.fillna(1)

In [None]:
#Fill different values for different columns
dframe2.fillna({0:0,1:1,2:2,3:3})

In [None]:
# Note that we still have access to the original dframe
dframe2

In [None]:
# To modify the existing object, use inplace
dframe2.fillna("varma",inplace=True)

In [None]:
# Now Let's see the dframe
dframe2

# GROUP-BY

In [None]:
import pandas as pd
df = pd.read_csv("D:/worldcitiespop.csv")

In [None]:
df 

In [None]:
g = df.groupby('City')
g

In [None]:
for city, city_df in g:
    print(city)
    print(city_df)

In [None]:
# or to get specified group
g.get_group('aixirivall')

In [None]:
# Find maximum temperature in each of the cities
print(g.max())

In [None]:
print(g.mean())

In [None]:
print(g.describe())

## Concatenate

##### Used to join two or more data frames

In [None]:
import pandas as pd
india_weather = pd.DataFrame({
         "city": ["hyderabad","vizag","banglore"],
          "temperature": [32,45,30],
          "humidity": [80, 60, 78]
})
india_weather

In [None]:
us_weather = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],
    "humidity": [68, 65, 75]
})
us_weather

In [None]:
#concate two dataframes
df = pd.concat([india_weather, us_weather])
df

In [None]:
# if you want continuous index
df = pd.concat([india_weather, us_weather], ignore_index=True)
df

In [None]:
df = pd.concat([india_weather, us_weather], axis=1)
df

## Merge DataFrames

Pandas merge() a single function for all standard database join operations between DataFrame objects.

In [None]:
temperature_df = pd.DataFrame({
"city": ["mumbai","delhi","banglore", 'hyderabad'],
"temperature": [32,45,30,40]})
temperature_df

In [None]:
humidity_df = pd.DataFrame({
"city": ["delhi","mumbai","banglore"],
"humidity": [68, 65, 75]})
humidity_df

In [None]:
# merge two dataframes with out explicitly mention index
df = pd.merge(temperature_df, humidity_df, on='city')
df

In [None]:
# OUTER-JOIN
df = pd.merge(temperature_df, humidity_df, on='city', how='outer')
df

In [None]:
# Lastly, everything works similarly in DataFrames
dframe1 = DataFrame(np.random.randn(4,3), columns=['X','Y','Z'])
dframe2 = DataFrame(np.random.randn(3,3), columns=['Y','Q','X'])

In [None]:
dframe1

In [None]:
dframe2

In [None]:
# Concat on DataFrame
pd.concat([dframe1,dframe2])

In [None]:
pd.concat([dframe1,dframe2], ignore_index=True)

## Merging

For more info on merge parameters check out:

'http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.merge.html
(http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.merge.html)'

In [None]:
dframe1 = DataFrame({'key':['X','Z','Y','Z','X','X'], 'data_set1': np.arange(6)})
dframe1

In [None]:
# another aframe
dframe2 = DataFrame({'key':['Q','Y','Z'],'data_set_2':[1,2,3]})
dframe2

In [None]:
#Merge wil automatically choose overlapping columns to merge on
pd.merge(dframe1,dframe2)

In [None]:
# specified which column to merge on
pd.merge(dframe1,dframe2,on='key')

In [None]:
# keys to use
pd.merge(dframe1,dframe2,on='key',how='left')

In [None]:
# on the right (dframe2)

In [None]:
# "outer" method selects the union of both keys
pd.merge(dframe1,dframe2,on='key',how='outer')

## Reshape

In [None]:
#Let's see how stack and unstack work

# Create DataFrame
dframe1 = DataFrame(np.arange(8).reshape((2, 4)),
                    index=pd.Index(['LA', 'SF'], name='city'),
                    columns=pd.Index(['A', 'B', 'C','D'], name='letter'))

In [None]:
dframe1

In [None]:
# Use stack to pivot the columns into the rows
dframe_st = dframe1.stack()
dframe_st

In [None]:
#rearrange back
dframe_st.unstack()

In [None]:
import pandas as pd # importing pandas = > useful for creating dataframes

In [None]:
x1 = [1,2,3,4,4] # list format
x2 = [10,11,12,13,14] # list format

In [None]:
x3 = list(range(5))

In [None]:
x1, x2, x3

In [None]:
# Creating a data frame using explicits lists
X = pd.DataFrame(columns = ['X1','X2','X3'])
X

In [None]:
X['X1'] = x1 # Converting list format into pandas series format
X['X2'] = x2 # Converting list format into pandas series format
X['X3'] = x3

In [None]:
X

In [None]:
X['X1'] = pd.Series(x1) # Converting list format into pandas series format
X['X2'] = pd.Series(x2) # Converting list format into pandas series format
X['X3'] = pd.Series(x3)

In [None]:
X

In [None]:
# Creating a data frame using explicits lists
X_new = pd.DataFrame(columns= ['X1','X2','X3'],index = [101,102,103,104,105])
X_new

In [None]:
X_new['X1'] = x1
X_new['X2'] = x2
X_new['X3'] = x3

In [None]:
X_new

In [None]:
# accessing columns using "." (dot) operation

In [None]:
X.X1

In [None]:
# accessing columns alternative way

In [None]:
X["X1"]

In [None]:
# Accessing multiple columns : giving column names as input in list format
X[["X1","X2"]]

In [None]:
# Accessing elements using ".iloc" : accessing each cell by row and column 
# index values
X.iloc[0:3,1]

In [None]:
X.iloc[:,:] # to get entire data frame

In [None]:
X.loc[0:2,["X1","X2"]]

In [None]:
#Stattistics
X

In [None]:
X['X3'].mean()

In [None]:
X['X3'].median()

In [None]:
X['X3'].mode()

In [None]:
X.describe()

In [None]:
# Merge operation using pandas 

In [None]:
df1 = pd.DataFrame({"X1":[1,2,3],"X2":[4,8,12],})
df2 = pd.DataFrame({"X1":[1,2,3,4],"X3":[14,18,112,15],})
df1,df2

In [None]:
merge = pd.merge(df1,df2, on = "X1") # merge function
merge

In [None]:
# Replace index name
df = pd.DataFrame({"X1":[1,2,3],"X2":[4,8,12]})
df

In [None]:
df.set_index("X1", inplace = True) # Assigning index names using column names

In [None]:
df

In [None]:
# Change the column names
df = pd.DataFrame({"X1":[1,2,3],"X2":[4,8,12],})

In [None]:
df = df.rename(columns = {"X2":"X4"}) #Change column names

In [None]:
print(df)

In [None]:
# Concatenation

In [None]:
f1 = pd.DataFrame({"X1":[1,2,3],"X2":[4,8,12],},index = {2000,2001,2002})
df2 = pd.DataFrame({"X1":[4,5,6],"X2":[14,16,18],},index = {2003,2004,2005})

In [None]:
Concatenate = pd.concat([df1,df2])

In [None]:
print(Concatenate)

In [None]:
x1 = [1, 2, 3, 4,5,np.nan] 
x2 = [np.nan, 11, 12,100,np.nan,200] 

In [None]:
df=pd.DataFrame()

In [None]:
df['x1']=x1
df['x2']=x2

In [None]:
df

In [None]:
#finding null values

In [None]:
df.isna().sum()

In [None]:
df.dropna()

In [None]:
# another way to create dataframe

In [None]:
df = pd.DataFrame(
    {"a" : [4,5,6],
     "b" : [7,8,9],           ## Dictionary Key value pairs                                                          
     "c" : [10,11,12]},
    index = [1,2,3])

In [None]:
df

In [None]:
# another way to create dataframe
df = pd.DataFrame(
     [[4,7,10],
     [5,8,11],
     [6,9,12]],
     index = [1,2,3],
    columns = ['a','b','c'])

In [None]:
df

In [None]:
a = pd.Series([50,40,34,30,22,28,17,19,20,13,9,15,10,7,3])

In [None]:
len(a)

In [None]:
a.plot()

In [None]:
a.plot(figsize =(8,6),
       color = 'green',title = 'line plot',fontsize = 12)

In [None]:
b = pd.Series([45,22,12,9,20,34,28,19,26,38,41,24,14,32])
len(b)

In [None]:
c = pd.Series([25,38,33,38,23,12,30,37,34,22,16,24,12,9])
len(c)

In [None]:
d = pd.DataFrame({'a':a,'b':b,'c':c})
d

In [None]:
d.plot.area(figsize = (9,6),title = 'Area plot')
d.plot.area(alpha= 0.4, color = ['coral','purple','lightgreen'],figsize = (8,6),fontsize = 12)

In [None]:
##############3 reading extrnal file
import pandas as pd
help(pd.read_csv)
# Import data (.csv file) using pandas. We are using mba data set
mba = pd.read_csv("D:/mba.csv")

In [None]:
mba

In [None]:
type(mba) # pandas data frame

In [None]:
mba.groupby('gmat').count()

In [None]:
mba.groupby('gmat').count()['datasrno']

In [None]:
list(mba.groupby('gmat'))

In [None]:
mba.groupby('gmat').sum().sort_values(by='workex')

In [None]:
mba.groupby('gmat').sum().sort_values(by='workex',ascending=False)

In [26]:
import pandas as pd

In [27]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
d

{'one': a    1
 b    2
 c    3
 dtype: int64,
 'two': a    1
 b    2
 c    3
 d    4
 dtype: int64}

In [28]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [29]:
df['three']=pd.Series([10,20,30],index=['a','b','c'])
df

Unnamed: 0,one,two,three
a,1.0,1,10.0
b,2.0,2,20.0
c,3.0,3,30.0
d,,4,


In [30]:
df['four']=df['one']+df['three']
df

Unnamed: 0,one,two,three,four
a,1.0,1,10.0,11.0
b,2.0,2,20.0,22.0
c,3.0,3,30.0,33.0
d,,4,,


### Column Deletion

In [31]:
import pandas as pd

In [32]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10,20,30], index=['a','b','c'])}

In [33]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two,three
a,1.0,1,10.0
b,2.0,2,20.0
c,3.0,3,30.0
d,,4,


#### using del function

In [34]:
del df['one']
df

Unnamed: 0,two,three
a,1,10.0
b,2,20.0
c,3,30.0
d,4,


#### using pop function

In [35]:
df.pop('two')
df

Unnamed: 0,three
a,10.0
b,20.0
c,30.0
d,


In [36]:
df

Unnamed: 0,three
a,10.0
b,20.0
c,30.0
d,


### Row Selection, Addition, and Deletion

In [37]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


### loc - Selection by Label¶

In [38]:
df.loc['b']

one    2.0
two    2.0
Name: b, dtype: float64

### iloc - Selection by integer location¶

In [39]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

df.iloc[2]

one    3.0
two    3.0
Name: c, dtype: float64

### Slice Rows

In [42]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [43]:
df[2:4]

Unnamed: 0,one,two
c,3.0,3
d,,4


### Addition of Rows - Using Append

In [44]:
import pandas as pd

df = pd.DataFrame([[1,2],[3,4]], columns = ['a','b'])
df2 = pd.DataFrame([[5,6],[7,8]], columns = ['a','b'])

df = df.append(df2)
df

Unnamed: 0,a,b
0,1,2
1,3,4
0,5,6
1,7,8


### Deletion of Rows - By using drop

In [45]:
import pandas as pd

df = pd.DataFrame([[1,2],[3,4]], columns = ['a','b'])
df2 = pd.DataFrame([[5,6],[7,8]], columns = ['a','b'])

df = df.append(df2)
df

Unnamed: 0,a,b
0,1,2
1,3,4
0,5,6
1,7,8


In [46]:
df = df.drop(0)
df

Unnamed: 0,a,b
1,3,4
1,7,8


### Indexing and Selecting Data¶

In [47]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8,4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

df

Unnamed: 0,A,B,C,D
a,0.508028,-1.827408,2.114543,-0.792586
b,1.605502,1.092372,0.672882,0.309479
c,0.556918,1.699633,0.178895,0.438679
d,-0.956933,2.986986,-0.057115,-0.719387
e,-0.474162,-0.539696,-0.828863,1.010799
f,-1.061397,0.297587,-0.294366,1.094477
g,-1.92271,0.543365,0.317931,0.554173
h,-1.854318,-0.206082,0.084855,-0.36432


#### select all rows for a specific column

In [48]:
df.loc[:,'A']

a    0.508028
b    1.605502
c    0.556918
d   -0.956933
e   -0.474162
f   -1.061397
g   -1.922710
h   -1.854318
Name: A, dtype: float64

### Select all rows for multiple columns, say list[]

In [49]:
df.loc[:,['A','C']]

Unnamed: 0,A,C
a,0.508028,2.114543
b,1.605502,0.672882
c,0.556918,0.178895
d,-0.956933,-0.057115
e,-0.474162,-0.828863
f,-1.061397,-0.294366
g,-1.92271,0.317931
h,-1.854318,0.084855


#### Select few rows for multiple columns, say list[]

In [50]:
df.loc[['a','b','f','h'],['A','C']]

Unnamed: 0,A,C
a,0.508028,2.114543
b,1.605502,0.672882
f,-1.061397,-0.294366
h,-1.854318,0.084855


### Select range of rows for all columns

In [51]:
df.loc['a':'h']

Unnamed: 0,A,B,C,D
a,0.508028,-1.827408,2.114543,-0.792586
b,1.605502,1.092372,0.672882,0.309479
c,0.556918,1.699633,0.178895,0.438679
d,-0.956933,2.986986,-0.057115,-0.719387
e,-0.474162,-0.539696,-0.828863,1.010799
f,-1.061397,0.297587,-0.294366,1.094477
g,-1.92271,0.543365,0.317931,0.554173
h,-1.854318,-0.206082,0.084855,-0.36432


### for getting values with a Boolean array

In [52]:
df.loc['a']>0

A     True
B    False
C     True
D    False
Name: a, dtype: bool

### select all rows for a specific column

In [53]:
df.iloc[:4]

Unnamed: 0,A,B,C,D
a,0.508028,-1.827408,2.114543,-0.792586
b,1.605502,1.092372,0.672882,0.309479
c,0.556918,1.699633,0.178895,0.438679
d,-0.956933,2.986986,-0.057115,-0.719387


### Integer slicing

In [54]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

df.iloc[:4]

Unnamed: 0,A,B,C,D
0,-0.094078,-1.771841,-0.029324,-1.752773
1,-0.147576,-0.134054,-0.151958,-0.102152
2,-0.04308,-0.881538,0.316097,0.398926
3,-1.857715,-0.762406,1.685136,-0.460164


In [55]:
df.iloc[1:5, 2:4]

Unnamed: 0,C,D
1,-0.151958,-0.102152
2,0.316097,0.398926
3,1.685136,-0.460164
4,1.241606,-0.994502


### Slicing through list of values

In [56]:
df.iloc[[1, 3, 5], [1, 3]]

Unnamed: 0,B,D
1,-0.134054,-0.102152
3,-0.762406,-0.460164
5,-0.41085,-0.550811


In [57]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
1,-0.147576,-0.134054,-0.151958,-0.102152
2,-0.04308,-0.881538,0.316097,0.398926


In [58]:
df.iloc[:,1:3]

Unnamed: 0,B,C
0,-1.771841,-0.029324
1,-0.134054,-0.151958
2,-0.881538,0.316097
3,-0.762406,1.685136
4,0.276041,1.241606
5,-0.41085,1.601677
6,0.437419,-0.417053
7,-0.389206,1.881528


## Missing Data

In [59]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])

df

Unnamed: 0,one,two,three
a,0.104303,0.147476,-1.543703
c,0.889921,-1.187806,0.041372
e,-0.057101,0.454609,0.366942
f,0.704777,-0.113984,0.792178
h,0.55666,0.782026,-0.674066


In [60]:
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df

Unnamed: 0,one,two,three
a,0.104303,0.147476,-1.543703
b,,,
c,0.889921,-1.187806,0.041372
d,,,
e,-0.057101,0.454609,0.366942
f,0.704777,-0.113984,0.792178
g,,,
h,0.55666,0.782026,-0.674066


### Checking the missing values for particular column¶

In [61]:
df['one'].isnull()

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

### Replace NaN with a Scalar Value¶

In [62]:
df.fillna(0)

Unnamed: 0,one,two,three
a,0.104303,0.147476,-1.543703
b,0.0,0.0,0.0
c,0.889921,-1.187806,0.041372
d,0.0,0.0,0.0
e,-0.057101,0.454609,0.366942
f,0.704777,-0.113984,0.792178
g,0.0,0.0,0.0
h,0.55666,0.782026,-0.674066


### Fill NA with Forward fill - ffill

In [63]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

df.fillna(method='ffill')

Unnamed: 0,one,two,three
a,-1.296365,0.638175,0.41433
b,-1.296365,0.638175,0.41433
c,1.307971,-0.216738,0.769356
d,1.307971,-0.216738,0.769356
e,0.581478,-2.512944,-0.795641
f,1.37245,-0.204507,0.290172
g,1.37245,-0.204507,0.290172
h,-0.101236,-0.704249,-0.519049


### Fill NA with Backward fill - bfill

In [64]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

df.fillna(method='bfill')

Unnamed: 0,one,two,three
a,-0.434231,0.290103,1.599962
b,0.359877,-0.682333,0.05227
c,0.359877,-0.682333,0.05227
d,2.369227,0.128283,-1.146745
e,2.369227,0.128283,-1.146745
f,-0.172592,1.32219,-2.175871
g,0.680891,1.990259,1.053697
h,0.680891,1.990259,1.053697


## Drop Missing Values¶

In [65]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

df.dropna()

Unnamed: 0,one,two,three
a,1.145902,1.581061,0.515559
c,0.981796,-0.984258,0.132546
e,-0.345766,-1.092726,0.063722
f,2.409541,-1.399452,1.260523
h,-0.172168,-1.554851,-1.011562


### Replace Missing (or) Generic Values¶

In [66]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})

df.replace({1000:10,2000:60})


Unnamed: 0,one,two
0,10,10
1,20,0
2,30,30
3,40,40
4,50,50
5,60,60


## groupby

In [67]:
sales=pd.DataFrame({'weekday':['sun','sun','mon','mon'],
                   'city':['austin','dallas','austin','dallas'],
                   'bread':[139,237,326,456],
                   'butter':[20,45,70,98]})

sales

Unnamed: 0,weekday,city,bread,butter
0,sun,austin,139,20
1,sun,dallas,237,45
2,mon,austin,326,70
3,mon,dallas,456,98


## Set_index

In [68]:
df2=sales.set_index(['city','weekday'])
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,bread,butter
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
austin,sun,139,20
dallas,sun,237,45
austin,mon,326,70
dallas,mon,456,98


### Groupby and count

In [69]:
sales.groupby(['weekday']).count()

Unnamed: 0_level_0,city,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
mon,2,2,2
sun,2,2,2


### Groupby and sum¶

In [70]:
sales.groupby(['weekday'])['bread'].sum()

weekday
mon    782
sun    376
Name: bread, dtype: int64

### Groupby and mean

In [71]:
sales.groupby(['city']).mean()

Unnamed: 0_level_0,bread,butter
city,Unnamed: 1_level_1,Unnamed: 2_level_1
austin,232.5,45.0
dallas,346.5,71.5


### Groupby maximum

In [72]:
sales.groupby(['city'])['bread','butter'].max()

  sales.groupby(['city'])['bread','butter'].max()


Unnamed: 0_level_0,bread,butter
city,Unnamed: 1_level_1,Unnamed: 2_level_1
austin,326,70
dallas,456,98


### Aggregation

In [73]:
sales.groupby('city')['bread','butter'].agg(['max','sum'])

  sales.groupby('city')['bread','butter'].agg(['max','sum'])


Unnamed: 0_level_0,bread,bread,butter,butter
Unnamed: 0_level_1,max,sum,max,sum
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
austin,326,465,70,90
dallas,456,693,98,143


### Concatenation¶

In [74]:
# import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])


pd.concat([one,two])

Unnamed: 0,Name,subject_id,Marks_scored
1,Alex,sub1,98
2,Amy,sub2,90
3,Allen,sub4,87
4,Alice,sub6,69
5,Ayoung,sub5,78
1,Billy,sub2,89
2,Brian,sub4,80
3,Bran,sub3,79
4,Bryce,sub6,97
5,Betty,sub5,88


### Concatenation BY COLUMNS

In [75]:
pd.concat([one,two],axis=1)

Unnamed: 0,Name,subject_id,Marks_scored,Name.1,subject_id.1,Marks_scored.1
1,Alex,sub1,98,Billy,sub2,89
2,Amy,sub2,90,Brian,sub4,80
3,Allen,sub4,87,Bran,sub3,79
4,Alice,sub6,69,Bryce,sub6,97
5,Ayoung,sub5,78,Betty,sub5,88


### Concatenating Using append

In [76]:
one.append(two)

Unnamed: 0,Name,subject_id,Marks_scored
1,Alex,sub1,98
2,Amy,sub2,90
3,Allen,sub4,87
4,Alice,sub6,69
5,Ayoung,sub5,78
1,Billy,sub2,89
2,Brian,sub4,80
3,Bran,sub3,79
4,Bryce,sub6,97
5,Betty,sub5,88


### TIME SERIES

Pandas provide a robust tool for working time with Time series data, especially in the financial sector. While working with time series data, we frequently come across the following −

1) Generating sequence of time

2) Convert the time series to different frequencies

# NumPy Library

    Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and    tools for working with these arrays. It is the fundamental package for scientific computing with Python.

    Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional container of generic data.

# What is Numpy?

    1) Numpy is the core library for scientific computing in Python.

    2) It provides a high-performance multidimensional array object, and tools for working with these arrays.
    
    3) Numpy's main object is the multidimensional array,
    
    4) It is a table of elements (usually numbers), all of the type, indexed by a tuple of positive integers.
    
    5) In Numpy dimensions are called axes.

# What is Multi-dimensional array?

    A[0,0], A[0,1], A[0,2]   ------------------>          1     2     3

    A[1,0], A[1,1], A[1,2]   ------------------>          5     6     7 

    A[2,0], A[2,1], A[2,2]   ------------------>          8     9     10

    A[3,0], A[3,1], A[3,2]   ------------------>          11    12    13



# How to Install Numpy in Anaconda

    - conda install numpy

    - pip install numpy

# How to import Numpy in program

Open jupyter notebook

    1) import numpy

# How to store numpy library in to variable
    
    import numpy as np

    here np means variable

    Can I use other variable name instead of np. Yes you can assign any variable name 

# Numpy vs Lists

    Why should I use Numpy array when I have List?

    Advantages of Numpy over List

        1) Less Memory

        2) Fast

        3) Convenient

# Advantages of the Numpy

Numpy is a fundamental package for the scientific computing in python.

It contains n dimensions array object.

It has tools to integrating the c, c++ and it is very useful to performing the linear algebra, fourier transform and random number capability etc.

# Data type               	Description

bool_---------------------> Boolean (True or False) stored as a byte

int_--------------------->Default integer type (same as C long; normally either int64 or int32)

intc--------------------->Identical to C int (normally int32 or int64)

intp--------------------->Integer used for indexing (same as C ssize_t; normally either int32 or int64)

int8--------------------->Byte (-128 to 127)

int16--------------------->Integer (-32768 to 32767)

int32--------------------->Integer (-2147483648 to 2147483647)

int64--------------------->Integer (-9223372036854775808 to 9223372036854775807)

uint8--------------------->Unsigned integer (0 to 255)

uint16--------------------->Unsigned integer (0 to 65535)

uint32--------------------->Unsigned integer (0 to 4294967295)

uint64--------------------->Unsigned integer (0 to 18446744073709551615)

float_--------------------->Shorthand for float64.

float16--------------------->Half precision float: sign bit, 5 bits exponent, 10 bits mantissa

float32--------------------->Single precision float: sign bit, 8 bits exponent, 23 bits mantissa

float64--------------------->Double precision float: sign bit, 11 bits exponent, 52 bits mantissa

complex_--------------------->Shorthand for complex128.

complex64--------------------->Complex number, represented by two 32-bit floats

complex128--------------------->Complex number, represented by two 64-bit floats

# What are the Methods we can apply on array

all()

any()

take()

put()

apply_along_axis()

apply_over_axes()

argmin()

argmax()

nanargmin()

nanargmax()

amax()

amin()

insert()

delete()

append()

around()

flip()

fliplr()

flipud()

triu()

tril()

tri()

empty()

empty_like()

zeros()

zeros_like()

ones()

ones_like()

full_like()

diag()

diagflat()

diag_indices()

asmatrix()

bmat()

eye()

roll()

identity()

arange()

place()

extract()

compress()

rot90()

tile()

reshape()

ravel()

isinf()

isrealobj()

isscalar()

isneginf()

isposinf()

iscomplex()

isnan()

iscomplexobj()

isreal()

isfinite()

isfortran()

exp()

exp2()

fix()

hypot()

absolute()

ceil()

floor()

degrees()

radians()

npv()

fv()

pv()

power()

float_power()

log()

log1()

log2()

log10()

dot()

vdot()

trunc()

divide()

floor_divide()

true_divide()

random.rand()

random.randn()

ndarray.flat()

expm1()

bincount()

rint()

equal()

not_equal()

less()

less_equal()

greater()

greater_equal()

prod()

square()

cbrt()

logical_or()

logical_and()

logical_not()

logical_xor()

array_equal()

array_equiv()

sin()

cos()

tan()

sinh()

cosh()

tanh()

arcsin()

arccos()

arctan()

arctan2()

## Numpy

### Numpy provides:

1. Extension package to Python for multi-dimensional arrays
2. Closer to hardware (efficiency)
3. Designed for scientific computation (convenience)
4. Also known as array oriented computing

--------------------------------------------------------------------
- What An Array
- Difference between Arrays and List
- What is Numpy
- Types of Numpy Arrays
- What are "vectors" and "matrices".
- Numpy Importance and benefits over Python lists
- Whats the use In Machine Leanrning

--------------------------------------------------------------------
- Create an Array, check Dimensions, shape and lenght
- Creating NumPy Arrays From a List

### Create a numpy array

In [1]:
import numpy as np

a = np.array([0,1,2,3])
print(a)

[0 1 2 3]


In [2]:
print(np.arange(10))

[0 1 2 3 4 5 6 7 8 9]


In [5]:
type(a)

numpy.ndarray

### Why it is useful in ML:

Memory-efficient container that provides fast numerical operations.

In [6]:
# python lists
L = range(1000)
%timeit [i**2 for i in L]

240 µs ± 6.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [7]:
a = np.arange(1000)
%timeit a**2

The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
13.4 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


## 1. Creating arrays (1D, 2D, 3D)

In [8]:
# 1-D
a = np.array([0,1,2,3])
a

array([0, 1, 2, 3])

In [9]:
# print dimensions
a.ndim

1

In [10]:
#shape
a.shape

(4,)

In [12]:
len(a)

4

In [13]:
# 2-D

b = np.array([[0,1,2],[3,4,5]])
b

array([[0, 1, 2],
       [3, 4, 5]])

In [14]:
b.ndim

2

In [15]:
b.shape

(2, 3)

In [18]:
len(b) # returns the size of the first dimention

2

In [19]:
# 3d array
c = np.array([[[0,1],[2,3]],[[4,5],[6,7]]])
c

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

In [20]:
c.ndim

3

In [21]:
c.shape

(2, 2, 2)

### 2. Creating NumPy Arrays From a List

In [22]:
# We know how to create a list, right
my_list = [1,2,3]
my_list

[1, 2, 3]

In [23]:
np.array(my_list)

array([1, 2, 3])

In [24]:
# We can convert a matrix which is also a list to array
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [25]:
np.array(my_matrix)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### 3. How to Generate Array

Just like we drag in excel and we can create an array, same there are other built in methods in python
in generation arrays

### arange

which will Return evenly spaced values within a given interval.

In [26]:
np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
np.arange(0,11,2) # With 2 integers gap

array([ 0,  2,  4,  6,  8, 10])

### zeros and ones

Generate arrays of zeros or ones

In [28]:
np.zeros(3)

array([0., 0., 0.])

In [29]:
# Array with 5 by 5 matrix
np.zeros((5,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [30]:
np.ones(3)

array([1., 1., 1.])

In [31]:
# Array with 3 by 3 matrix
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

## Linspace

Return evenly spaced numbers over a specified interval.

In [32]:
# I want to get 3 evenly spaced points between 0 to 5
np.linspace(0,10,3)

array([ 0.,  5., 10.])

In [33]:
np.linspace(0,10,50)

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

## eye

Creates an identity matrix

- An indentity matrix if you are not familier is used in linear algebra problems and is a two dimentioanl
squared matrix means number of rows are same as columns with diagonals of ones and everything is
zero

In [34]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

## Random

Numpy also has lots of ways to create random number arrays:
    
## rand

Create an array of the given shape and populate it with random samples from a uniform distribution over [0,1] .

In [35]:
np.random.rand(2)

array([0.52476899, 0.92390428])

In [36]:
np.random.rand(5,5)

array([[0.11530202, 0.26484606, 0.51699122, 0.81083992, 0.58242304],
       [0.86269471, 0.76614248, 0.77196954, 0.90253417, 0.71861243],
       [0.35452928, 0.32679249, 0.4124277 , 0.08417838, 0.62820539],
       [0.11347664, 0.07936212, 0.95049623, 0.73775567, 0.26980661],
       [0.56331268, 0.84873672, 0.07456801, 0.81254321, 0.92501996]])

## randn

Return a sample (or samples) from the "standard normal" distribution. Unlike rand which is uniform:

In [37]:
np.random.randn(2)

array([-0.49531639,  0.81492063])

In [38]:
np.random.randn(5,5)

array([[-1.31240864, -2.88117677, -0.40069952, -0.70304256, -1.47074564],
       [-0.64668158,  0.04423155,  0.35509169,  1.61069741, -1.19660245],
       [-0.69662668,  2.38401296,  1.30517393, -0.73735609, -0.1917702 ],
       [ 0.70548493, -0.08201796, -1.64998473,  0.17852204, -0.06199639],
       [ 0.44039446, -1.25221458, -0.34020824, -0.46896511,  1.44829647]])

## randint

Return random integers from low (inclusive) to high (exclusive).

In [40]:
np.random.randint(1,100)

38

In [41]:
np.random.randint(1,100,10)

array([18, 81, 22, 88, 77, 31, 53, 59, 97, 94])

## max, min
These are useful methods for finding max or min values.

In [42]:
arr = np.arange(25)

In [43]:
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

In [44]:
arr.max()

24

In [45]:
arr.min()

0

## Shape
Shape is an attribute that arrays have (not a method):

In [46]:
# Vector
arr.shape

(25,)

In [47]:
# Notice the two sets of brackets
arr.reshape(1,25)

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24]])

In [48]:
arr.reshape(1,25).shape

(1, 25)

## dtype
You can also grab the data type of the object in the array:

In [49]:
arr.dtype

dtype('int32')

## Calculations on Arrays
Before doing it create a 2d array by using double parentheses/brackets and then perform calculations on
them

In [50]:
arr1 = np.array([[1,2,3],[8,9,10]])

In [51]:
# 1. Adding arrays
arr1+arr1

array([[ 2,  4,  6],
       [16, 18, 20]])

In [52]:
# 2. Multiplying arrays
arr1*arr1

array([[  1,   4,   9],
       [ 64,  81, 100]])

In [53]:
# 3. Subtracting arrays
arr1-arr1

array([[0, 0, 0],
       [0, 0, 0]])

In [54]:
# 4. Dividing arrays (Float return)
arr1//arr1

array([[1, 1, 1],
       [1, 1, 1]], dtype=int32)

## Indexing Arrays

Arrays are sequenced, they are modified in place by slice operations.

In [55]:
arr = np.arange(11)

In [56]:
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [57]:
slice_of_arr = arr[0:6]
slice_of_arr

array([0, 1, 2, 3, 4, 5])

In [58]:
arr_copy = arr.copy()
arr_copy

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

## Format to slice
arr_2d [row][col] or arr_2d[row,col]

In [59]:
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
arr_2d

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

In [60]:
# To just grab a row
arr_2d[1]

array([20, 25, 30])

In [61]:
# To grab an individual element
arr_2d[1][0]

20

In [62]:
# or as mentioned above there are two formats
arr_2d[1,0]

20

In [63]:
import numpy as np

# A list of elements in variable 'x'

x = [1,2,3,4,5]

# how to multiply the list values with 2
x*2 #  provides dual list

# Numpy array will help access the values


[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

In [64]:
y = np.array(x)

In [65]:
y

array([1, 2, 3, 4, 5])

In [66]:
type(y)

numpy.ndarray

In [67]:
y*2

array([ 2,  4,  6,  8, 10])

In [68]:
y>2

array([False, False,  True,  True,  True])

In [69]:
y[y>2]

array([3, 4, 5])

In [70]:
# operator ,Description general information only 

In [71]:
np.array([1,2,3]) # 1d array

array([1, 2, 3])

In [72]:
np.array([(1,2,3),(4,5,6)]) # 2d array

array([[1, 2, 3],
       [4, 5, 6]])

#### np.arange(start,stop,step) # range array

In [75]:
np.linspace(0,2,9) # add evenly spaced values btw interval to aray of length

array([0.  , 0.25, 0.5 , 0.75, 1.  , 1.25, 1.5 , 1.75, 2.  ])

In [76]:
np.zeros((1,2)) # create array filled with zeros

array([[0., 0.]])

In [77]:
np.ones((1,2)) # creates an array filled with ones

array([[1., 1.]])

In [78]:
np.random.random((5,5)) # creates random array

array([[0.85732053, 0.25921383, 0.95012824, 0.93425596, 0.5946853 ],
       [0.48462607, 0.36765787, 0.24796623, 0.50853921, 0.61796144],
       [0.37415529, 0.20880856, 0.70029546, 0.4072834 , 0.72436762],
       [0.29144997, 0.94041266, 0.17737077, 0.19458803, 0.91607491],
       [0.59028915, 0.02353667, 0.68726138, 0.77305946, 0.13006466]])

In [79]:
np.empty((2,2)) #creates an empty array

array([[1., 1.],
       [1., 1.]])

In [81]:
## array.shape # gives information on dimensions

In [83]:
## len(array) 

In [84]:
## array.ndim # number of  array dimension

In [85]:
## array.dtype # Data Type

In [86]:
# Numpy matrics 
#Ex:1
a = np.matrix('1 2; 3 4')
type(a)
a

matrix([[1, 2],
        [3, 4]])

In [87]:
#Ex:2
b = np.matrix([[1, 2], [3, 4]])
b

matrix([[1, 2],
        [3, 4]])

In [88]:
b.shape

(2, 2)

In [89]:
# create a sequence of integers with specific values

f = np.arange(0,50,5)
f

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45])

In [90]:
#np.arange vs range
f=range(0,50,5)
f

range(0, 50, 5)

In [91]:
f=list(range(0,50,5))
f

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45]

In [92]:
#reshaping the numpy array
ary = np.array([[2,3,4,5],[6,8,4,7],[9,5,1,3]])
ary

array([[2, 3, 4, 5],
       [6, 8, 4, 7],
       [9, 5, 1, 3]])

In [93]:
ary.shape

(3, 4)

In [94]:
ary[1]

array([6, 8, 4, 7])

In [95]:
ary[0,3]

5

In [96]:
newary = ary.reshape(6,2)

In [97]:
newary.shape

(6, 2)

In [98]:
newary


array([[2, 3],
       [4, 5],
       [6, 8],
       [4, 7],
       [9, 5],
       [1, 3]])

In [99]:
#flatten
ar2 = newary.flatten()
ar2


array([2, 3, 4, 5, 6, 8, 4, 7, 9, 5, 1, 3])

In [101]:
#sort
ary = np.array([[2,3,4,5],[6,8,4,7],[9,5,1,3]])
ary

array([[2, 3, 4, 5],
       [6, 8, 4, 7],
       [9, 5, 1, 3]])

In [102]:
ary.sort()
ary

array([[2, 3, 4, 5],
       [4, 6, 7, 8],
       [1, 3, 5, 9]])

In [103]:
#axis
ary = np.array([[2,3,4,5],[6,8,4,7],[9,5,1,3]])
ary

array([[2, 3, 4, 5],
       [6, 8, 4, 7],
       [9, 5, 1, 3]])

In [104]:
ary2=np.delete(ary,1,axis=1)
ary2

array([[2, 4, 5],
       [6, 4, 7],
       [9, 1, 3]])

In [105]:
ary = np.array([[2,3,4,5],[6,8,4,7],[9,5,1,3]])
ary

array([[2, 3, 4, 5],
       [6, 8, 4, 7],
       [9, 5, 1, 3]])

In [106]:
ary3=np.delete(ary,1,axis=0)
ary3

array([[2, 3, 4, 5],
       [9, 5, 1, 3]])

In [107]:
dir(np)

['ALLOW_THREADS',
 'AxisError',
 'BUFSIZE',
 'CLIP',
 'DataSource',
 'ERR_CALL',
 'ERR_DEFAULT',
 'ERR_IGNORE',
 'ERR_LOG',
 'ERR_PRINT',
 'ERR_RAISE',
 'ERR_WARN',
 'FLOATING_POINT_SUPPORT',
 'FPE_DIVIDEBYZERO',
 'FPE_INVALID',
 'FPE_OVERFLOW',
 'FPE_UNDERFLOW',
 'False_',
 'Inf',
 'Infinity',
 'MAXDIMS',
 'MAY_SHARE_BOUNDS',
 'MAY_SHARE_EXACT',
 'MachAr',
 'NAN',
 'NINF',
 'NZERO',
 'NaN',
 'PINF',
 'PZERO',
 'RAISE',
 'SHIFT_DIVIDEBYZERO',
 'SHIFT_INVALID',
 'SHIFT_OVERFLOW',
 'SHIFT_UNDERFLOW',
 'ScalarType',
 'Tester',
 'TooHardError',
 'True_',
 'UFUNC_BUFSIZE_DEFAULT',
 'UFUNC_PYVALS_NAME',
 'WRAP',
 '_NoValue',
 '_UFUNC_API',
 '__NUMPY_SETUP__',
 '__all__',
 '__builtins__',
 '__cached__',
 '__config__',
 '__dir__',
 '__doc__',
 '__file__',
 '__getattr__',
 '__git_revision__',
 '__loader__',
 '__mkl_version__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_add_newdoc_ufunc',
 '_distributor_init',
 '_globals',
 '_mat',
 '_pytesttester',
 'abs',
 'absol