<div style="text-align:center">
    <img src="../files/monolearn-logo.png" height="150px">
    <h1>ML course</h1>
    <h3>Session 02: NumPy, Pandas, Matplotlib, Seaborn, Chart types</h3>
    <h4><a href="https://amzenterprise.ir/">Ali Momenzadeh</a></h5>
</div>

### Numpy

In [None]:
# pip install numpy

#### What is Numpy?

NumPy is a Python package which stands for ‘Numerical Python’. It is the core library for scientific computing, which contains a powerful n-dimensional array object, provide tools for integrating C, C++ etc. 

It is also useful in linear algebra, random number capability etc. NumPy array can also be used as an efficient multi-dimensional container for generic data. Now, let me tell you what exactly is a python numpy array.

<img src="../files/2/330px-NumPy_logo_2020.svg.png">

#### NumPy Array
Numpy array is a powerful N-dimensional array object which is in the form of rows and columns. We can initialize numpy arrays from nested Python lists and access it elements. In order to perform these numpy operations, the next question which will come in your mind is:

<img src="../files/2/1t7STdM.gif">

#### Import NumPy

In [None]:
import numpy as np

In [None]:
np.__version__

In [None]:
print(np.__version__)

#### Python list vs NumPy array

NumPy gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogeneous. The mathematical operations that are meant to be performed on arrays would be extremely inefficient if the arrays weren’t homogeneous.

#### Why use NumPy?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

#### Create arrays in NumPy

<img src = "../files/2/np_create_matrix.png">

In [None]:
a = np.array([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]])
print(a)

In [None]:
print(type(a))

#### Select a NumPy array items

<img src = "../files/2/numpy_indexing.png" style="width: 50%">

a[2:10:2, 1:9:3]

In [None]:
a

In [None]:
print(np.where(a > 5, a, 0))

<img src = "../files/2/np_indexing.png" style="width: 100%">

In [None]:
a

In [None]:
a[0]

In [None]:
a[2,1]

In [None]:
#a[2,3]

#### Shape and Reshape in NumPy

In [None]:
print(a.shape)

In [None]:
type(a.shape)

In [None]:
print(a.shape[1])

In [None]:
# 2D array (or matrix)
b = np.array([[1, 2, 3], [4, 5, 6]])

In [None]:
b.shape

In [None]:
b

In [None]:
print(np.reshape(b, (3, 2)))

<img src="../files/2/numpy_array_t.png" width="50%"/>

#### Reshaping and flattening multidimensional arrays

In [None]:
# -1 means the number of columns will be determined automatically
print(np.reshape(b, (1, -1))) 

In [None]:
# -1 means the number of rows will be determined automatically
print(np.reshape(b, (-1, 1)))

There are two popular ways to flatten an array: .flatten() and .ravel(). The primary difference between the two is that the new array created using ravel() is actually a reference to the parent array (i.e., a “view”). This means that any changes to the new array will affect the parent array as well. Since ravel does not create a copy, it’s memory efficient.

In [None]:
b

In [None]:
np.ravel(b)

<img src="../files/2/ravelvsflatten.jpg">

In [None]:
print(b)

In [None]:
b.flatten()

In [None]:
b.ravel()

#### Other array properties in NumPy

In [None]:
print(a.ndim)

In [None]:
print(a.dtype)

In [None]:
c = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64)

In [None]:
print(c.dtype)

In [None]:
c

In [None]:
# number of elements
print(a.size)

In [None]:
a.itemsize

In [None]:
# size of each element (in bytes)
print(c.itemsize)

#### arange in NumPy

In [None]:
d1 = np.arange(1, 20, step=3)
print(d1)

In [None]:
np.arange

#### linspace in NumPy

In [None]:
d2 = np.linspace(1, 2, num=5)
print(d2)

In [None]:
d3 = np.linspace(1, 2, num=11)
print(d3)

#### Create specific arrays in NumPy

- `np.ones`
- `np.zeros`
- `np.full`
- `np.eye`

<img src="../files/2/np_array_dataones.png" /> 

In [None]:
print(np.ones(shape=(3, 2)))

In [None]:
print(np.zeros(shape=(2, 3)))

In [None]:
print(np.zeros((2, 3)))

In [None]:
print(np.zeros(shape=(2, 3), dtype=np.int32))

In [None]:
print(5. * np.ones(shape=(3, 2)))

In [None]:
print(np.full((3, 2), 5))

In [None]:
print(np.eye(4))

In [None]:
print(np.fliplr(np.eye(4)))

In [None]:
print(np.random.rand(3, 2))

#### Operations on NumPy arrays

<img src="../files/2/np_sub_mult_divide.png" /> 

In [None]:
x = np.array([[1, 2], [3, 4]], dtype=np.float64)
y = np.array([[5, 6], [7, 8]], dtype=np.float64)

print(x)
print()
print(y)

In [None]:
# Elementwise sum; both produce the array
print(x + y)
print()
print(np.add(x, y))

In [None]:
# Elementwise difference; both produce the array
print(x - y)
print()
print(np.subtract(x, y))

In [None]:
# Elementwise product; both produce the array
print(x * y)
print()
print(np.multiply(x, y))

In [None]:
# Elementwise division; both produce the array
print(x / y)
print()
print(np.divide(x, y))

In [None]:
# Elementwise square root; produces the array
print(np.sqrt(x))

#### Adding, removing, and sorting elements in NumPy arrays

In [None]:
arr = np.array([2, 1, 5, 3, 7, 4, 6, 8])

In [None]:
np.sort(arr)

In [None]:
a = np.array([10, 20, 30, 40])
b = np.array([50, 60, 70, 80])

In [None]:
np.concatenate((a, b))

#### More useful array operations in NumPy

<img src="../files/2/np_aggregation.png" /> 

<img src="../files/2/np_matrix_aggregation_row.png" /> 

In [None]:
a

In [None]:
type(a)

In [None]:
a.sum()

In [None]:
a.min()

In [None]:
a.max()

### Pandas

In [None]:
# pip install pandas

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("titanic.csv")

In [None]:
type(df)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df['survived'][0:5]

In [None]:
df.survived[0:5]

In [None]:
df[['survived','age','fare']][10:25]

In [None]:
df['survived'].value_counts()

In [None]:
df['survived'].value_counts(normalize=True)* 100

339 femalre and 161 male survived

In [None]:
pd.crosstab(df.sex, df.survived)

#### Normalize data

In [None]:
# Normalize data by index (rows)
pd.crosstab(df.sex, df.survived, normalize = "index")

In [None]:
# Normalize data by columns
pd.crosstab(df.sex, df.survived, normalize = "columns")

In [None]:
pd.crosstab(df.sex, df.survived, normalize = "all")

#### Let's investigate some questions (queries) on data

1. List of all children below or equal 5 years old

In [None]:
below_5_years = df[df.age <= 5]

In [None]:
below_5_years

In [None]:
below_5_years[0:3]

2. Number of children below or equal 5 years old

In [None]:
len(below_5_years)

3. Number of survived children

In [None]:
df[df.age <= 5][["survived","age","pclass"]]

In [None]:
df[df.age <= 5]["survived"].value_counts()

4. Proportion of survived children

In [None]:
df[df.age <= 5]["survived"].value_counts(normalize = True)

In [None]:
df[df.age <= 5]["survived"].value_counts(normalize = True) * 100

5. Get Allen's information

In [None]:
df[df.name.str.contains( "Allen")]

In [None]:
df[df.name.str.contains( "Ali")]

6. Get uniq values in each column

In [None]:
df.embarked.unique()

7. Print age, sex and pclass of survived passengers

In [None]:
df[(df.survived == 1) & (df.age <= 5)][['age','sex','pclass']][10:20]

8. Print age, sex and pclass of survived passengers with unknown age

In [None]:
df[df.age.isnull()][['age', 'survived', 'sex', 'pclass']][0:5]

9. Print age, sex and pclass of survived passengers with known age

In [None]:
df[-df.age.isnull()][['age', 'survived', 'sex', 'pclass']][0:5]

#### mean and groupby

In [None]:
df.groupby('pclass')['fare'].mean()

In [None]:
pclass_age_mean_df = df.groupby('pclass')['age'].mean().reset_index()
pclass_age_mean_df

In [None]:
pclass_gender_age_mean_df = df.groupby(['pclass', 'sex'])['fare'].mean().reset_index()
pclass_gender_age_mean_df

#### Sort a dataframe

In [None]:
df.sort_values( 'age', ascending=False)

In [None]:
df.sort_values('survived', ascending = False)

#### Subset a dataframe in Pandas

##### Select a single column

In [None]:
single_column = df['age']
single_column

In [None]:
type(single_column)

##### Select multiple columns

In [None]:
multiple_column = df[['age','pclass','fare']]
multiple_column

##### Selection and indexing

<img src="../files/2/Pandas-selections-and-indexing-1024x731.png" width=100%>

##### Notes:


* When selecting subsets of data, square brackets [] are used.

* Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

* Select specific rows and/or columns using loc when using the row and column names

* Select specific rows and/or columns using iloc when using the positions in the table

##### iloc

##### The .iloc attribute is the primary access method. The following are valid inputs:

- An integer e.g. 5.

- A list or array of integers [4, 3, 0].

- A slice object with ints 1:7.

- A boolean array.

- A callable, see Selection By Callable.

<img src="../files/2/pandas_select_row_by_index_iloc.png">

##### Select a single row

In [None]:
df

In [None]:
df.iloc[3]

##### Select multiple rows

In [None]:
df.iloc[[3, 5, 7]]

##### Select multiple rows and columns

In [None]:
df.iloc[[3, 4], [1, 2, 6]]

In [None]:
df.iloc[9:25, 2:5]

##### Select all rows and some columns
<img src="../files/2/iloc_select_all_rows_and_first_column_in_pandas.png">

In [None]:
df.iloc[:, [1, 2]]

In [None]:
df

In [None]:
col= [False, False, True, True, True,False, False, True, True, True,False, False, True, True]
df.iloc[1:10, col]

##### .loc

##### Select a single row

In [None]:
df.loc[1]

##### Select multiple rows

In [None]:
df.loc[:,['pclass','embarked','fare']]

##### Query()

In [None]:
df.query('survived == 1 and embarked == "S" and pclass == 1 and fare > 200')

In [None]:
df.loc[(df.survived==1)&(df.embarked=='S')&(df.fare>200)&(df.pclass==1)]

In [None]:
df.embarked.unique()

In [None]:
df[(df['pclass']==1) & (df['embarked']=='S') & (df['fare']>200) & (df['survived']==1)]

### Matplotlib and Seaborn (Charts)

<img src="../files/2/logo2_compressed.svg" width=50%>

The primary plotting library for Python is called **Matplotlib**.

**Seaborn** is a plotting library that offers a simpler interface, sensible defaults for plots needed for machine learning, and most importantly, the plots are aesthetically better looking than those in Matplotlib.

- Seaborn requires that Matplotlib is installed first.

- You can install Matplotlib directly using pip, as follows:

In [None]:
# pip install matplotlib
# pip install seaborn

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
#When using the 'inline' backend, your matplotlib graphs will be included in your notebook, next to the code.

import seaborn as sns

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)

In [None]:
def sinplot():
    x = np.linspace(0, 14, 100)
    for i in range(1, 7):
        plt.plot(x, np.sin(x + i * .5) * (7 - i))

sinplot()

In [None]:
import seaborn as sns
print('seaborn: %s' % sns.__version__)

In [None]:
tips = sns.load_dataset('tips')

In [None]:
type(tips)

In [None]:
tips.head()

In [None]:
tips.info()

#### Visualize the distribution of data

In [None]:
sns.displot(tips, x="total_bill")

In [None]:
sns.displot(tips, x="total_bill", bins=6)

In [None]:
sns.displot(tips, x="size")

In [None]:
sns.displot(tips, x="size", discrete=True)

In [None]:
sns.displot(tips, x="size", shrink=0.8)

In [None]:
sns.displot(tips, x="total_bill", kind="kde")

In [None]:
sns.displot(tips, x="tip", hue='sex', kind="kde")

In [None]:
sns.displot(tips, x="total_bill", hue='smoker', kind="kde")

In [None]:
sns.displot(tips, x="total_bill", hue='smoker',kind="kde", fill="True")

#### Set theme and colors

In [None]:
sns.set_theme(style="ticks", color_codes=True)

In [None]:
sns.color_palette()

In [None]:
sns.color_palette("Paired")

In [None]:
sns.color_palette("rocket")

In [None]:
sns.color_palette("mako")

#### Visualize the distribution of data when we have categorical data

If one of the main variables is “categorical” (divided into discrete groups) it may be helpful to use a more specialized approach to visualization.

##### Scatter plots

In [None]:
sns.catplot(x="day", y="total_bill", data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", jitter=False, data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", hue="sex", kind="swarm", data=tips)

In [None]:
sns.catplot(x="size", y="total_bill", data=tips)

In [None]:
sns.catplot(x="smoker", y="tip", data=tips)

In [None]:
sns.catplot(x="total_bill", y="day", hue="time", kind="swarm", data=tips)

##### Box plots

<img src="../files/2/boxplot-components-e1541330442828.png.webp" /> 

In [None]:
sns.catplot(x="day", y="total_bill", kind="box", data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker", kind="box", data=tips)

In [None]:
sns.catplot(x="size", y="total_bill", hue="smoker", kind="boxen", data=tips)

##### Bar plots

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker", kind="bar", data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker", kind="point", data=tips)

In [None]:
sns.catplot(x="time", kind="count", data=tips)

In [None]:
sns.catplot(x="time",hue='sex', kind="count",palette="ch:.25", data=tips)

In [None]:
ax = sns.catplot(x="time", kind="count", data=tips)
ax.set(xlabel='X', ylabel='Y')
plt.show()

#### Visualize statistical relationships

In [None]:
sns.relplot(x="total_bill", y="tip", data=tips)

In [None]:
sns.set_theme(style="darkgrid")
sns.relplot(x="total_bill", y="tip", data=tips)

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips)

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", style="smoker",data=tips)

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", style="time", data=tips)

In [None]:
sns.relplot(x="total_bill", y="tip", hue="size", data=tips)

In [None]:
sns.relplot(x="total_bill", y="tip", hue="size",kind='line', data=tips)

In [None]:
sns.relplot(x="total_bill", y="tip", size="size", data=tips)

##### Line plots

In [None]:
df = pd.DataFrame(dict(time=np.arange(500),value=np.random.randn(500).cumsum()))

In [None]:
df

In [None]:
g = sns.relplot(x="time", y="value", kind="line", data=df)

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker",col="time", data=tips)

#### Visualize regression models

In [None]:
sns.regplot(x="total_bill", y="tip", data=tips)

In [None]:
sns.lmplot(x="total_bill", y="tip", data=tips)

#### Visualize multiple relationships with Facetgrid

In [None]:
sns.FacetGrid(tips)

In [None]:
sns.FacetGrid(tips, col="time", row="sex")

In [None]:
g = sns.FacetGrid(tips, col="time",  row="sex")
g.map(sns.scatterplot, "total_bill", "tip")

In [None]:
g = sns.FacetGrid(tips, col="time",  row="sex")
g.map_dataframe(sns.histplot, x="total_bill")

In [None]:
g = sns.FacetGrid(tips, col="time",  row="sex")
g.map_dataframe(sns.histplot, x="total_bill")
g.set_axis_labels("Total bill", "Count")

In [None]:
g = sns.FacetGrid(tips, col="time", row="sex")
g.map_dataframe(sns.histplot, x="total_bill", binwidth=2)
g.set_axis_labels("Total bill", "Count")

In [None]:
g = sns.FacetGrid(tips, col="time", hue="sex")
g.map_dataframe(sns.scatterplot, x="total_bill", y="tip")
g.set_axis_labels("Total bill", "Tip")
g.add_legend()

In [None]:
g = sns.FacetGrid(tips, col="day", height=3.5, aspect=.65)
g.map(sns.histplot, "total_bill")