### AccelerateAI - Python for Data Science
##### Introduction to Python Language  (Python 3) 
In this notebook we will cover the following: 
* 1. NumPy       <br> 
* 2. Pandas      <br>
* 3. String & Text <br>

We will cover the following in Notebook 2:
* 5. Data Wrangling <br>
* 6. Advanced Numpy          <br>
***

#### 1. Numpy
- Numpy is short for <b>Num</b>erical <b>Py</b>thon
- It is a fundamental package required for high performance scientific computing and data analysis
- Many data analysis libraries are built on top of NumPy
- Here are some of the things it provides:
    - <b>ndarray</b>, a fast and space-efficient multidimensional array providing 
        - fast vectorized array operations for data munging and cleaning
        - common array algorithms like sorting, unique, and set operations 
        - efficient descriptive statistics and aggregating/summarizing data 
        - data alignment and relational data manipulations for merging and joining datasets
        - expressing conditional logic as array expressions instead of loops with if-elifelse branches
        - group-wise data manipulations 
    - tools for reading / writing array data to disk and working with memory-mapped files 
    - linear algebra, random number generation, and Fourier transform capabilities
    - tools for integrating code written in C, C++, and Fortran

##### 1.1.1 Scalars in Numpy
- Python defines only one type of a particular data class (1 integer type, 1 floating-point type, etc.)
- In NumPy, there are 24 new fundamental Python types to describe different types of scalars
        - numpy.generic : Base class for numpy scalar types
        - numpy.ushort, numpy.uint, numpy.ulonglong etc for unsigned integer
        - numpy.half, numpy.single, numpy.double, numpy.longdouble etc : for floating point 
        - numpy.datetime for storing time
        - numpy.str_ for string

In [None]:
#importing numpy
import numpy as np

In [None]:
print(np.ScalarType)

In [None]:
x = np.int8(1)
y = np.float32(1.0)
x == y                                          #What do you expect the result to be?      

In [None]:
print(x.dtype , y.dtype)

In [None]:
#check current version of pandas
print("Pandas Version:", pd.__version__)

##### 1.1.2 Numpy Constants - NumPy includes several constants
 - numpy.Inf
 - numpy.nan    (NaN, NAN are aliases)
 - numpy.NINF
 - numpy.PZERO
 - numpy.NZERO
 - numpy.pi
 - numpy.euler_gamma

In [None]:
np.log(0)

In [None]:
np.NZERO

In [None]:
np.pi

##### 1.2 Numpy Ndarray creation and reshape

In [None]:
#importing numpy
import numpy as np

In [None]:
#creating array
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1                                        #Note the way it is represented with 2 paranthesis

In [None]:
print("Dimensions:", arr1.ndim)

In [None]:
# Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

In [None]:
print("Dimensions:", arr2.ndim)
print("Shape:", arr2.shape)

In [None]:
# other functions for creating new arrays.
a = np.zeros(10)                                #array will all zero
b = np.ones((3, 6))                             #array with all ones
print("a=", a,"\n","b=",b)

In [None]:
np.eye(3)                                       #Identity matrix

In [None]:
np.random.rand(3,3)                             # random values

In [None]:
arr = np.arange(8)
arr

In [None]:
#change the dimension of the array
arr.reshape((4, 2))                               #Note the order of numbers 

In [None]:
x = arr.reshape((4,2), order='F')                  #Order C- C language type, F - Fortran type
x

In [None]:
#chaining 
np.arange(10).reshape((2,5))

In [None]:
#convert from higher dimension to single dimension
x.flatten()                                              #Note the order of numbers

##### 1.3 Ndarray attributes

In [None]:
arr = np.arange(9).reshape(3,3, order='F')
arr

In [None]:
arr.max()                                                    #max value      min- minimum value

In [None]:
arr.argmax(axis=1)                                           # argmax - index of maximum value, similarly argmin

In [None]:
arr.dtype                                                    #data type

##### 1.4 Array arithmetic 
 - arithmetic operations applies the operation elementwise

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr + 10.

In [None]:
arr * 2

In [None]:
arr*arr

In [None]:
np.sqrt(arr)

In [None]:
np.square(arr)

In [None]:
arr.sum()                                             #total of all elements

In [None]:
arr.mean()                                            #mean of all elements           

In [None]:
arr.mean(axis=0)                                      #column means                         

##### 1.5 Array Slicing

In [None]:
#slicing
arr = np.arange(10)
arr

In [None]:
arr[0]                                       #index starts at 0

In [None]:
arr[5:8]                                     #slicing within a range

In [None]:
#broadcasting 
arr[5:8] = 12                                #slicing creates a view - hence original arr is modified
arr

In [None]:
# Sllicing in higher dimension
arr = np.arange(9).reshape(3,3)
arr

In [None]:
# slicing : row, column
arr[1,2]

In [None]:
# slicing : rowstart:rowend-1, columnstart:columnend-1

In [None]:
arr[1:3,1:3]                                # will return a 2 x 2 array

In [None]:
#What would be the output?
arr[1,:3]                                  

In [None]:
arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr

In [None]:
arr[[4, 3, 6]]                                #select specified rows

In [None]:
#Array copy
arr = np.array([11, 12, 13, 14, 15])

x = arr.copy()                                #creates a copy of original array

arr[0] = 42
print(arr)
print(x)                                      # note that X doesn't change

##### 1.6 Transpose, Swapping and Dot product

In [None]:
arr = np.arange(16).reshape(4, 4)
arr

In [None]:
arr.T                                                   #transpose 

In [None]:
arr.swapaxes(1,0)                                       #same as transpose - as there are only 2 axes

In [None]:
np.dot(arr.T, arr)                                      #dot product

In [None]:
arr.trace()

In [None]:
np.linalg.eig(arr)                                       #eigen values & eigen vectors

In [None]:
np.linalg.svd(arr)                                       #singular value decomposition

##### 1.7 Numpy Example

In [None]:
sample1 = np.random.normal(100,15,40)
sample2= np.random.normal(125,15,40)

In [None]:
def t_test(x, y):
    diff = y - x
    var = np.var(diff, ddof=1)
    num = np.mean(diff)
    denom = np.sqrt(var / len(x))
    return np.divide(num, denom)

In [None]:
# Null hypothesis : mean(sample1) = mean(sample2)

t_stats = t_test(sample1, sample2)


In [None]:
from scipy import stats
dof = len(sample1) - 1
p_value = 1- stats.distributions.t.cdf(t_stats, dof)

print("The t value is {} and the p value is {}.".format(t_stats, p_value))               

#### 2. Pandas
- Pandas is a Python package providing fast and flexible data structure
- It is designed to work with structured, relational or labeled (tabular) data 
- It has functions for reading, analyzing, cleaning, exploring(plotting), and manipulating data
- It works very well with large amount of data for indexing, subsetting, slicing, reshaping and merging
- It is also great for working with time series data with functionality for quick filtering and plotting
- The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" 

##### 2.1 Pandas data structures
- Pandas has a few important data structures:
    - Series: 1 dimensional array holding data of any type (like a column in a table)
    - DataFrame: 2 dimensional data structre (a table with rows and columns)
    - Datetime : also an important data structure for working with dates
<br><br>
- Other data structures like pandas array, strings etc are not in common use.
- Panel data a 3 dimesional data structure for storing panel data, is now deprecated and no longer in use.

###### 2.1.1 Pandas Series

In [None]:
#import pandas 
import pandas as pd
import numpy as np

In [None]:
#creating a series from list
a = [1, 7, 2, 5,9]
myseries = pd.Series(a)

print(myseries)                                            #notice the index printed with the values, along with data type

In [None]:
#specifying the index 
myserwithind = pd.Series(a, index = ["a", "b", "c", "d", "e"])
print(myserwithind) 

In [None]:
#creating a series from dictionary
calories = {"day1": 420, "day2": 380, "day3": 390}
mydiet = pd.Series(calories)

print(mydiet)                                                 #where do the keys go? 

In [None]:
#Naming the series attributes 

fruitprice = {'apples': 200, 'kiwi': 300, 'oranges': 70, 'cherries': 500, 'banana':30, 'guava':55}
mySeries = pd.Series(fruitprice)

mySeries.name = 'March Fruit Prices'                          #name is the sweetest sound for every individual
mySeries.index.name = 'Fruit'

print(mySeries)

In [None]:
 mySeries.ndim

In [None]:
mySeries.size

In [None]:
#which is the most expensive fruit?
mySeries.idxmax()                                       # argmax() is depreciated

In [None]:
#which are the two cheapest fruits
mySeries.nsmallest(n=2, keep='last')                    # similarly nlargest()

In [None]:
#Average price of all fruits
mySeries.mean()

In [None]:
#Sort the series by value
mySeries.sort_values()                                  

In [None]:
#Sort the series by index
mySeries.sort_index()                                   # alphabetical order

In [None]:
#Selecting an element in a pandas series
print(mySeries[1])
print(mySeries['kiwi'])
print(mySeries.kiwi)
print(mySeries.loc['kiwi'])
print(mySeries.iloc[1])

In [None]:
#price of kiwi increases 
mySeries['kiwi'] = 350                              
print(mySeries)

In [None]:
#searching using index 
'apples' in mySeries

In [None]:
#Adding two series 
basket1 = pd.Series({'apples': 5, 'kiwi': 10, 'oranges': 7, 'cherries': 50})
basket2 = pd.Series({'apples': 6, 'pineapple': 2, 'oranges': 6, 'banana': 12})

total = basket1 + basket2
print(total)                                        # What would be the output? 

In [None]:
#Appending two series 
basket1.append(basket2)                             # Now? 

In [None]:
basket1                                               # Does this change basket1? 

In [None]:
basket1.plot()                                        # plotting a series

###### 2.1.2 DataFrame

In [None]:
#creating a dataframe from dictionary
df1 = pd.DataFrame(
    {
        "A": [1.0,2.3,3.4,4.3],
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": pd.Categorical(["spam","ham"]*2)
    }
)

In [None]:
df1

In [None]:
df1.index

In [None]:
df1.shape                                   #(rows , columns)

In [None]:
#quick statistic summary of the data (numerical)
df1.describe()

In [None]:
#quick look - random sample
df.sample(3)                                       #head(n) and tail(n) also works as intended

###### 2.1.2.1 Row and Column Selection

In [None]:
#selection of rows
df1[0:2]                                  #list style indexing

In [None]:
#selection of rows
df1.loc[0:2,]                              #optimized and recommended

In [None]:
#selection of columns
df1.loc[:, ["A", "B"]]                

In [None]:
#selection using numerical position
df1.iloc[0:3, 1:4]

In [None]:
# Conditional selection using a single column
df1[df1.A>3.0]

In [None]:
#query function 
df1.query('E == "test"')                             # equivaluent to df1[df1['E'] == "test"]       

In [None]:
#adding a new column
df1["G"] = ["one", "two", "three", "four"]

In [None]:
df1[df1.G.isin(["two", "four"])]                   #.isin() for selection 

In [None]:
#dropping a column
df1.drop('G', axis=1 , inplace=True)               #inplace=True will make changes to existing dataframe              
df1

In [None]:
#setting values 
import numpy as np
df1.loc[:, "D"] = np.arange(10,14)

In [None]:
df1

##### 2.1.2.2 Grouping Data 
- this function takes several params and returns DataFrameGroupBy object that contains information about the groups.
- we can use groupby() with the combination of sum(), pivot(), transform(), aggregate() and many more methods.
- Syntax of DataFrame.groupby()<br>
    DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
    sort=True, group_keys=True, squeeze=<no_default>,      <br>
    observed=False, dropna=True)                           

In [None]:
#grouping data
df = pd.DataFrame({'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
                   'Month':['Jan','Feb','Jan','Feb','Jan','Feb'],
                   'Sales':[200,120,340,124,243,350]})
df

In [None]:
#average sales per company
df.groupby("Company").mean()               

In [None]:
#Total sales per month
df.groupby("Month").sum()  

In [None]:
#group by a numerical value(bins) calculate a statistic
s_groups = pd.cut(df['Sales'], bins=[100, 200, 300, np.inf])
df.groupby(s_groups).count()

In [None]:
#Cross tabulation 
pd.crosstab(df.Company, df.Month)

##### 2.1.2.3 Dealing with missing data 

In [None]:
d = {'A':[1,2,np.nan,3],'B':[5,np.nan, np.nan,7],'C':[1,2,3,4]}
df = pd.DataFrame(d)
df

In [None]:
#drop rows with NaN value
df.dropna() 

In [None]:
# drops cols with null values with given threshold (atleast n non-NaNs for the column to survive)
df.dropna(axis=1, thresh=3)

In [None]:
#fill NaN with a user specified value
df.fillna(value="99")                              #to change the df use inplace=True

In [None]:
#fill NaN with different value for different column

colfill = {"A": -99, "B": 99, "C": 999}
df.fillna(value=colfill)

In [None]:
#fill NaN with forward fill method - last valid number
df.fillna(method='ffill')                         

In [None]:
#fill NaN with forward fill method - next valid number
df.fillna(method='backfill')                         

In [None]:
df['A'].mean()

In [None]:
# mean imputation
df.fillna(value= df['A'].mean()) 

##### 2.1.2.4 Merging  & Joining Datasets 

In [None]:
d1 = {"Id": ['I01', 'I02', 'I03', 'I04','I05'],
     "Name":['Aamir', 'Salman', 'Shahrukh', 'Akshay', 'Hrithik'], 
      "Age":[45, 54, 55, 56, 44],} 

d2 = {"Id": ['I02', 'I01', 'I04', 'I03'],
 "Address":["Delhi", "Gurgaon", "Noida", "Pune"], 
 "Qualification":["Btech", "B.A", "Bcom", "B.hons"]}

df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

In [None]:
# concat() is used for combining Data Frames across rows or columns.
pd.concat([df1,df2], axis=0, sort=False)                     #ignore_index=True

In [None]:
# concat() is used for combining Data Frames across rows or columns.
pd.concat([df1,df2], axis=1, sort=False)

In [None]:
df1.append(df2, sort=False)                  #same as concat(axis=0)

In [None]:
# merge() is used for combining data on common columns or indices.
df1.merge(df2)                                    #Inner Join - automatically on "Id" - common rows

In [None]:
pd.merge(df1,df2,left_on="Id",right_on="Id",how='inner')        #if column names are different

In [None]:
pd.merge(df1,df2,how='left')                                   #Left Join

In [None]:
pd.merge(df1,df2,how='outer')                                   #Outer Join

In [None]:
# join() is used for combining data on index.
df1.join(df2, lsuffix='_l', rsuffix='_r')                  #Same column get renamed

##### 2.1.2.5 Operations on Datasets

In [None]:
df.columns

In [None]:
df.index

In [None]:
#count of each column
df1.count()

In [None]:
# Stats for numeric cols
df1.describe()

In [None]:
# unique values
df2["Address"].unique()

In [None]:
df1['Age'].value_counts()

In [None]:
# apply a fucntion to each row
df1["Age"].apply(lambda x: x/2) 

In [None]:
df1.sort_values('Age')

##### 2.1.2.6 Data Input & Output

In [None]:
#Reading csv file into a DataFrame
netflix_df = pd.read_csv("netflix_subscription_fee.csv", )
netflix_df.head()

In [None]:
#reading an excel file into a DataFrame
insurance_df = pd.read_excel("insurance_data.xlsx",sheet_name=0, parse_dates=True)
insurance_df.count()

In [None]:
insurance_df.InsuredValue.max()

In [None]:
import html5lib
#Read HTML tables into a list of DataFrame objects.
data = pd.read_html('https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/')

In [None]:
data[0]

In [None]:
#reading a json file 
iris_df = pd.read_json("https://raw.githubusercontent.com/domoritz/maps/master/data/iris.json")

In [None]:
iris_df.head()

In [None]:
# writing to a csv file
iris_df.to_csv("iris.csv")

##### 2.1.2.7 Basic Plots 
- Pandas uses the plot() method to create diagrams.
- Pyplot, a submodule of the Matplotlib library can be used to visualize the diagram on the screen.

In [None]:
# read motor trends cars data from the web   
data = pd.read_html("https://gist.github.com/seankross/a412dfbd88b3db70b74b")
mtcars_df = data[0]
mtcars_df.head()

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline             

In [None]:
# retreiving mtcars data from the web
data = pd.read_html("https://gist.github.com/seankross/a412dfbd88b3db70b74b")
mtcars_df = data[0]

In [None]:
# scatter plot
mtcars_df.plot(kind = 'scatter', x = 'disp', y = 'hp')
plt.show()

In [None]:
#histogram
mtcars_df["mpg"].plot(kind = 'hist', )

#### 3. String & Text

In [None]:
#a string is a sequence of characters.
message = "Have a wonderful day!"
print(message[0:5])

In [None]:
#number of characters in a string
len(message)

In [None]:
message.upper()

In [None]:
# index of a substring
message.find("day")

In [None]:
#how many 'a' are there?
message.count('a')

In [None]:
# break into 3 parts 
message.partition(' ')

In [None]:
# break into seperate words
message.rsplit(" ")

In [None]:
#String formatting 
quantity = 3
itemno = 567
price = 49.95
myorder = "I want {} pieces of item {} for {} dollars."

In [None]:
# c-style printing
print("I want %d pieces of item %d for %.2f dollars." %(quantity, itemno, price))

In [None]:
# string formatting function
print(myorder.format(quantity, itemno, price))

In [None]:
# f-string
print(f'I want {quantity} pieces of item {itemno} for {price} dollars.')

In [None]:
# f-srting are fast and they are evaluated as expressions
list = [1, 2, 3, 4, 5]
print(f'The sum is: {sum(list)}')

##### 3.2 Regular expressions
- a package called re, which can be used to work with regular expressions
- Functions:
     - findall: Returns a list containing all matches
     - search : Returns a Match object if there is a match anywhere in the string
     - split: Returns a list where the string has been split at each match
     - sub: Replaces one or many matches with a string
     
- Character description
    - []	A set of characters		
    - \	Signals a special sequence	
    - .	Any character 	
    - ^	Starts with	
    - $	Ends with		
    - '*'	Zero or more occurrences
    - '+'	One or more occurrences		
    - ?	Zero or one occurrences		
    - {}	Exactly the specified number of occurrences	
    - |	Either or	
    - ()	Capture and group

In [None]:
import re

In [None]:
text = "It rained in Spain"
x = re.findall("ai", text)
print(x)

In [None]:
x = re.split("\s", text)        # \s -> white space character
print(x)

In [None]:
# replacing parts of strings
str = 'aaa@gmail.com bbb@hotmail.com ccc@apple.com'
print(re.sub('[a-z]*@', 'info@', str))             #anything before @ with info

In [None]:
str = 'aaa@gmail.com bbb@hotmail.com ccc@apple.com'
print(re.sub('gmail|hotmail|apple', 'accelerateai', str))

In [None]:
# searching for specific pattern - re.search() 
# it takes a regular expression pattern and a string and searches for that pattern within the string. 
# If the search is successful, search() returns a match object

str = 'purple alice-b@google.com monkey dishwasher'

match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## (the whole match)
    print(match.group(1))  ## (the username, group 1)
    print(match.group(2))  ## (the host, group 2)

In [None]:
## re.findall() returns a list of all the found email strings
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) 
for email in emails:
    print(email)