<center>
<img src="https://i.redd.it/nva5b9dq8r631.png" alt="Python-MEME" width="500" height="600">
</center>

In [None]:
# If the picture does not work for you (there is no Internet connection),
# you can uncomment these two lines below and then run the cell. The picture should appear.
# These two:

#from IPython.display import Image
#Image("./images/nva5b9dq8r631.png")

## Part I: Jupyter system recap

**The most important feature** of jupyter notebooks / labs for this course: 
* if you're typing something, press `Tab` to see automatic suggestions, use arrow keys + enter to pick one.
* if you move your cursor inside some function and press `Shift + Tab`, you'll get a help window.

In [None]:
import math

# It is good practice to declare all required libraries (modules) at the beginning of your code.
# You can forget about this in the study notebook.

In [None]:
m                  # Tab

math.              # Tab

math.atan2         # Shift and Tab for docs

math.atan2(1, 2)  # atan2(y, x)

In [None]:
# Return the arctangent of x, in radians.

math.atan(0.5)

In [None]:
# Return atan(y / x), in radians. The result is between -pi and pi.

math.atan2(1, 2)

In [None]:
math.atan(0.25)

In [None]:
math.atan2(1, 4)

In [None]:
del math

In [None]:
ma  # Press Tab

# Where is the module 'math'?
# Do you understand it?

## Part II: Numpy and vectorized computing

Almost any machine learning model requires some computational heavy lifting usually involving linear algebra problems. Unfortunately, raw python is terrible at this because each operation is interpreted at runtime. 

So instead, we'll use `NumPy` - a library that lets you run blazing fast computation with vectors, matrices and other tensors. The object here is `numpy.ndarray`.

https://numpy.org

https://numpy.org/devdocs/user/quickstart.html

https://numpy.org/doc/stable/user/basics.creation.html

https://www.w3schools.com/python/numpy_intro.asp

https://github.com/rougier/numpy-tutorial

https://www.datacamp.com/community/tutorials/python-numpy-tutorial

https://www.guru99.com/numpy-tutorial.html

### Creation of arrays and operations with them

In [None]:
import numpy as np

# We can initialize NumPy arrays from nested Python lists, and access elements using square brackets
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("a = ", a)
print("b = ", b)

# Math and boolean operations can be applied to each element of an array
print("a + 1 = ", a + 1)
print("a * 2 = ", a * 2)
print("a == 2 ", a == 2)

# And corresponding elements of two (or more) arrays
print("a + b = ", a + b)
print("a * b = ", a * b)
print("a / b = ", a / b)

In [None]:
type(a)

https://webcourses.ucf.edu/courses/1249560/pages/python-lists-vs-numpy-arrays-what-is-the-difference

In [None]:
x = np.array([[1, 2.0], [0, 0],
              (1 + 1j, 3.)])  # Note mix of lists and tuple, and mix of types
x

In [None]:
x = np.array([[1, 2.0], [0, 0], (1 + 1j, 3.)], dtype=np.int)

In [None]:
b = np.array([[1.5, 2, 3], [4, 5, 6]], dtype=np.complex)
b

In [None]:
a1 = np.array([1, 2, 3, 4, 5])
b1 = np.array([5, 4, 3, 2])

print("a1 + b1 = ", a1 + b1)
print("a1 * b1 = ", a1 * b1)

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

c = (a * b) / 2
c

https://cs231n.github.io/python-numpy-tutorial/#numpy

There are several functions to create arrays of zeros, ones, ascending/descending numbers etc.:

In [None]:
np.zeros(shape=(3, 4))

In [None]:
np.ones(shape=(2, 5))

In [None]:
4 * np.ones(shape=(2, 5))

In [None]:
np.ones(shape=(2, 5), dtype=np.bool)

In [None]:
np.full((2, 2), 99)

In [None]:
np.eye(4)

# Return a 2-D array with ones on the diagonal and zeros elsewhere.

In [None]:
np.random.random((2, 2))

### Sorting and combining arrays

In [None]:
array = np.array([7, 5, 3, 2, 6, 1, 4])
array

In [None]:
sorted_array = np.sort(array)
sorted_array

In [None]:
reverse_array = sorted_array[::-1]
reverse_array

In [None]:
np.arange(3, 15, 2)  # start, stop, step

https://numpy.org/doc/stable/reference/generated/numpy.linspace.html

In [None]:
np.linspace(0, 10, 11)  # Divide [0, 10] interval into 11 points

In [None]:
np.logspace(1, 10, base=2, dtype=np.int64)  # Base 2, number = 50

In [None]:
np.logspace(1, 10, 10, base=2)  # Base 2, number = 10

In [None]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

In [None]:
a.shape

In [None]:
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):

b = a[:2, 1:3]
b

In [None]:
b = a[:3, 0:4]
b

In [None]:
b.shape

In [None]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

In [None]:
# Two ways of accessing the data in the middle row of the array.
# Mixing integer indexing with slices yields an array of lower rank,
# while using only slices yields an array of the same rank as the
# original array:

row_r1 = a[1, :]  # Rank 1 view of the second row of a
row_r1

In [None]:
row_r1.shape

In [None]:
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
row_r2

In [None]:
row_r2.shape

In [None]:
row_r1 = a[1, 1:3]  # Rank 1 view of the second row of a
row_r1

In [None]:
a = np.array([[1, 2], [3, 4], [5, 6]])

bool_idx = (a > 2)

bool_idx

You can easily reshape arrays:

https://www.w3schools.com/python/numpy_array_reshape.asp

In [None]:
np.arange(24)

In [None]:
np.arange(24).reshape(6, 4)

In [None]:
np.arange(24).reshape(2, 3, 4)

In [None]:
np.arange(24).reshape(4, 3, 2)

or add dimensions of size 1:

https://numpy.org/doc/stable/reference/arrays.indexing.html
    
https://www.edureka.co/community/66684/how-does-numpy-newaxis-work-and-when-to-use-it

https://www.programcreek.com/python/example/12692/numpy.newaxis

In [None]:
print(np.arange(3)[:, np.newaxis])

In [None]:
print(np.arange(3)[np.newaxis, :])

In [None]:
np.arange(3)[:, np.newaxis] + np.arange(3)[np.newaxis, :]

There are also several ways to stack arrays together:

In [None]:
matrix1 = np.arange(50).reshape(10, 5)
matrix1

In [None]:
matrix2 = -np.arange(20).reshape(10, 2)
matrix2

In [None]:
np.concatenate([matrix1, matrix2],
               axis=1)  # Join a sequence of arrays along an existing axis.

# About axis. The axis along which the arrays will be joined. If an axis is None, arrays are flattened before use.
# Default is 0.

# Axes are defined for arrays with more than one dimension.
# A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0),
# and the second running horizontally across columns (axis 1).

In [None]:
# Many operations can take place along one of these axes.
# For example, we can sum each row of an array, in which case we operate along with columns, or axis 1:

x = np.arange(12).reshape(3, 4)
x

In [None]:
x.sum(axis=1)  # See below about 'sum'.

In [None]:
x.sum(axis=0)

In [None]:
x.sum(axis=-1)

In [None]:
matrix1 = np.arange(50).reshape(10, 5)
matrix2 = -np.arange(20).reshape(10, 2)

np.concatenate([matrix1, matrix2])  # Default is 0.

Any matrix can be transposed easily:

In [None]:
matrix2.T  # The transposed array.

### Problem №0 (all answers are at the end of the notebook):

Write a NumPy program to compute the determinant of a given square array.

You have a = np.array([[1, 0], [1, 2]]).

### Problem №1 (all answers are at the end of the notebook):

You need to replace the middle column in the array with a new column.

You have sampleArray = np.array([[34, 43, 73], [82, 22, 12], [53, 94, 66]]) 
and newColumn = np.array([[10, 10, 10]]).

## Part III: Pandas

Pandas is a library that helps you load the data, prepare it and perform some lightweight analysis. The object here is the `pandas.DataFrame`.

https://pandas.pydata.org

https://pandas.pydata.org/docs/

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

https://www.kaggle.com/learn/pandas # Remember this site (www.kaggle.com).

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python

### Creating and reading data with pandas

In [None]:
a = np.random.normal(size=100)
a

In [None]:
import pandas as pd

dataframe = pd.DataFrame(a)
dataframe

In [None]:
a = np.random.normal(size=20)
df = pd.DataFrame(a, columns=['Text'], dtype=np.complex)
df

In [None]:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                  columns=['A', 'B', 'C'])
df

In [None]:
df.A

In [None]:
type(df)

In [None]:
type(df.A)

In [None]:
a_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
a_array

In [None]:
type(a_array)

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

a_series = pd.Series(a_array, name='a_array', dtype='float32')
a_series

In [None]:
type(a_series)

In [None]:
a_dataframe = pd.DataFrame({'a_series': a_series})
a_dataframe

In [None]:
type(a_dataframe)

In [None]:
# If you don't have an Internet connection,
# you can uncomment these two lines below and then run them.
# The file will be uploaded to the place where you are now.
# And command 'mv' moves the file to 'data'.
# These lines:

#!wget -i https://github.com/HSE-LAMBDA/MLatMIPS-2020/raw/master/Introduction/train.csv
#!mv train.csv ./data/.

In [None]:
!ls -lhtr ./data/train.csv

# csv is comma-separated values.

# https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/

In [None]:
import pandas as pd
data = pd.read_csv("./data/train.csv")

# This is Titanic dataset.

In [None]:
data

In [None]:
data = pd.read_csv("./data/train.csv", index_col='PassengerId')

In [None]:
data

In [None]:
data.shape

In [None]:
data.shape[1]

In [None]:
data.  # Press Tab

### Referring to data items

In [None]:
head = data[:10]
head

In [None]:
type(head)

In [None]:
# Return the first n rows.

data.head()

In [None]:
data.head(10)

In [None]:
# Return the last n rows.

data.tail()

In [None]:
data.tail(10)

In [None]:
# Return a random sample of items from an axis of the object.

data.sample()

In [None]:
data.sample(5)

#### About the data
Here are some of the columns
* Name - a string with a person's full name
* Survived - 1 if a person survived the shipwreck, 0 otherwise.
* Pclass - passenger class. Pclass == 3 is cheap'n'cheerful, Pclass == 1 is for moneybags.
* Sex - a person's gender
* Age - age in years, if available
* Sibsp - number of siblings on a ship
* Parch - number of parents on a ship
* Fare - ticket cost
* Embarked - the port where the passenger embarked
 * C = Cherbourg; Q = Queenstown; S = Southampton

In [None]:
# Table dimensions

print("len(data) = ", len(data))
print("data.shape = ", data.shape)

In [None]:
# Select a single row

print(data.loc[4])  # See below about 'loc'

In [None]:
a = data.loc[4]
type(a)

In [None]:
# Select a single column.

ages = data["Age"]
print(ages[:10])  # Alternatively: data.Age

In [None]:
data.Age[:10]

In [None]:
# Select several columns and rows at once

data.loc[5:10, ("Fare",
                "Pclass")]  # Alternatively: data[["Fare", "Pclass"]].loc[5:10]

In [None]:
data[["Fare", "Pclass"]].loc[5:10]

In [None]:
data[data["Sex"] == 'female']

In [None]:
data[data["Pclass"] == 3]

In [None]:
data[data["Pclass"] < 3]

#### `loc` vs `iloc`

There are two ways of indexing the rows in pandas:
 *   by index column values (`PassengerId` in our case) – use `data.loc` for that
 *   by positional index - use `data.iloc` for that

Note that index column starts from 1, so positional index 0 will correspond to index column value 1, positional 1 to index column value 2, and so on:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
    
https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/
    
https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different

In [None]:
data.index

In [None]:
print(data.iloc[0])

In [None]:
print(data.loc[1])

Pandas also have some basic data analysis tools. For one, you can quickly display statistical aggregates for each column using `.describe()`

In [None]:
data.describe()  # Generate descriptive statistics.

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [None]:
data["Sex"].describe()

### Operations with DataFrame data

In [None]:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df

In [None]:
df.apply(np.sqrt)  # Apply a function along an axis of the DataFrame.

# See below about connection NumPy and pandas

In [None]:
df.apply(np.sum, axis=0)

In [None]:
df.apply(np.sum, axis=1)

In [None]:
df.apply(lambda x: x, axis=1)

In [None]:
df = pd.DataFrame({
    'A': range(1, 6),
    'B': range(10, 0, -2),
    'C': range(10, 5, -1)
})

df

In [None]:
df.query(
    'A > B')  # Query the columns of a DataFrame with a boolean expression.

In [None]:
# The previous expression is equivalent to

df[df.A > df.B]

In [None]:
df.query('B == C')

In [None]:
# The previous expression is equivalent to

df[df.B == df['C']]

In [None]:
df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
df

In [None]:
pd.eval("double_age = df.age * 2", target=df)

# Evaluate a Python expression as a string using various backends.

In [None]:
df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
df

In [None]:
df.eval('A + B')

In [None]:
df.eval('C = A + B')

In [None]:
df

In [None]:
df.eval('C = A + B', inplace=True)
df

In [None]:
df

In [None]:
df.eval('''
 C = A + B
 D = A - B
 ''')

In [None]:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df

In [None]:
df.loc[0, 'A'] = 1
df

In [None]:
df.loc['QWERTY', 'A'] = 1
df

In [None]:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df

In [None]:
df.loc[0, 'A'] = 1
df

In [None]:
df.loc[1] = [2, 3, 4]
df

In [None]:
df.loc[2] = {'A': 3, 'C': 9, 'B': 9}
df

In [None]:
df.loc[1] = [5, 6, 7]
df

In [None]:
df.loc[0, 'B'] = 8
df

In [None]:
df_empty = pd.DataFrame(columns=['A', 'B', 'C'])
df_empty

In [None]:
dff = df_empty.append(df)
dff.loc[0, 'C'] = 99
dff

In [None]:
df

In [None]:
dff.loc[3] = {'A': 3, 'C': 9, 'B': 9}
dff

In [None]:
dff_second = dff.append(df, ignore_index=True)
dff_second

In [None]:
dff_third = dff.append(df)
dff_third

In [None]:
one = pd.DataFrame(
    {
        'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'],
        'Marks_scored': [98, 90, 87, 69, 78]
    },
    index=[1, 2, 3, 4, 5])

two = pd.DataFrame(
    {
        'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'],
        'Marks_scored': [89, 80, 79, 97, 88]
    },
    index=[1, 2, 3, 4, 5])

In [None]:
one

In [None]:
two

In [None]:
pd.concat([one, two])

In [None]:
pd.concat([one, two], keys=['x', 'y'])

In [None]:
a = pd.concat([one, two], ignore_index=True)
a

https://stackoverflow.com/questions/12555323/adding-new-column-to-existing-dataframe-in-python-pandas

https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/

https://www.interviewqs.com/ddi_code_snippets/add_new_col_df_default_value

https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe

https://thispointer.com/python-pandas-how-to-add-rows-in-a-dataframe-using-dataframe-append-loc-iloc/

https://pythonexamples.org/pandas-dataframe-add-append-row/

https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/

https://www.w3resource.com/pandas/dataframe/dataframe-drop.php

https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/

https://stackoverflow.com/questions/18172851/deleting-dataframe-row-in-pandas-based-on-column-value

https://note.nkmk.me/en/python-pandas-drop/

Russian: https://www.rupython.com/pandas-dataframe-python-del-937.html

In [None]:
del a['Name']
a

In [None]:
a = pd.concat([one, two], ignore_index=True)
a

In [None]:
a.drop(['Name'], axis='columns', inplace=True)
a

In [None]:
a.drop([3, 5], axis='rows', inplace=True)
a

In [None]:
df = pd.DataFrame(np.arange(10).reshape(5, 2), columns=list('ab'))
df

In [None]:
df = df.drop([0, 4])
df

In [None]:
# But let's go back now to our dataset.

### Operations with NaN values

Some columns contain __NaN__ values - this means that there is no data there. For example, passenger `#5` has an unknown age. To simplify the future data analysis, we'll replace NaN values by using pandas `fillna` function.

# Note: we do this so easily because it's a tutorial. In general, you think twice before you modify data like this.

In [None]:
data.iloc[5]

In [None]:
data.loc[889]

In [None]:
# Fill NA/NaN values using the specified method.

data['Age'] = data['Age'].fillna(value=data['Age'].mean())
data['Fare'] = data['Fare'].fillna(value=data['Fare'].mean())

In [None]:
data.iloc[5]

In [None]:
data.loc[889]

In [None]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5], [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))

df

In [None]:
df.fillna(0)

In [None]:
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)

In [None]:
df = pd.DataFrame({
    "name": ['Alfred', 'Batman', 'Catwoman'],
    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
    "born": [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]
})

df

In [None]:
df.dropna()  # Remove missing values.

In [None]:
df.dropna(axis='columns')

In [None]:
df.dropna(axis='rows')

In [None]:
# Drop the rows where all elements are missing.

df.dropna(how='all')

In [None]:
# Keep only the rows with at least 2 non-NA values.

df.dropna(thresh=2)

In [None]:
# Define in which columns to look for missing values.

df.dropna(subset=['name', 'born'])

https://pandas.pydata.org/pandas-docs/version/1.0.0/user_guide/missing_data.html

https://chartio.com/resources/tutorials/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe/

https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/

https://stackoverflow.com/questions/46305837/pandas-selecting-nan-values-using-np-nan

Russian: https://pythonru.com/biblioteki/not-a-number-vse-o-nan-pd-5

In [None]:
df = pd.DataFrame(np.random.randn(5, 3),
                  index=['a', 'c', 'e', 'f', 'h'],
                  columns=['one', 'two', 'three'])
df

In [None]:
df['four'] = 'bar'
df['five'] = df['one'] > 0
df

In [None]:
# Conform Series/DataFrame to a new index with optional filling logic.

# Places NaN in locations having no value in the previous index.

df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

In [None]:
df2['one']

In [None]:
pd.isna(df2['one'])  # Detect missing values.

In [None]:
df2['four'].notna()  # Detect existing (non-missing) values.

In [None]:
df2.isna()

In [None]:
# One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do.
# Note that pandas/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.

In [None]:
None == None

In [None]:
np.nan == np.nan

In [None]:
# So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.

df2['one'] == np.nan

* Bunch of cheat sheets awaits just one google query away from you (e.g. [basics](http://blog.yhat.com/static/img/datacamp-cheat.png), [combining datasets](https://pbs.twimg.com/media/C65MaMpVwAA3v0A.jpg) and so on). 

### Problem №2 (all answers are at the end of the notebook):

Write a Pandas program to count the number of rows and columns of a DataFrame.

You have sample DataFrame:
exam_data = {
    'name': [
        'Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
        'Matthew', 'Laura', 'Kevin', 'Jonas'
    ],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify':
    ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'].

### Problem №3 (all answers are at the end of the notebook):

Write a Pandas program to calculate the mean score for each different student in a data frame.

You have sample DataFrame:
exam_data = {
    'name': [
        'Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
        'Matthew', 'Laura', 'Kevin', 'Jonas'
    ],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify':
    ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'].

## Part IV: Other NumPy functions and features and connection NumPy and pandas

There's also a bunch of pre-implemented operations including logarithms, trigonometry, vector/matrix products and aggregations.

https://blog.thedataincubator.com/2018/02/numpy-and-pandas/

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("numpy.sum(a) = ", np.sum(a))
print("numpy.mean(a) = ", np.mean(a))
print("numpy.min(a) = ", np.min(a))
print("numpy.argmin(b) = ", np.argmin(b))  # Index of the minimal element
print("numpy.dot(a, b) = ",
      np.dot(a, b))  # Dot product. Also used for matrix/tensor multiplication

print("np.unique = ",
      np.unique(['male', 'male', 'female', 'female', 'male',
                 'qqq']))  # Find the unique elements of an array.

# and tons of other stuff. See http://bit.ly/2u5q430 .

In [None]:
c = np.unique(['male', 'male', 'female', 'female', 'male', 'qqq'])
c

In [None]:
c = np.unique(['2', '3', '4', '5', '6', '1'])
c

The important part: all this functionality works with data frames, as you can get their NumPy representation with `.values` (most NumPy functions will even work on pure pandas objects):

In [None]:
# Calling np.max on a pure pandas column:
print("Max ticket price: ", np.max(data["Fare"]))

# Calling np.argmax on a NumPy representation of a pandas co
print("\nThe guy who paid the most:\n",
      data.iloc[np.argmax(data["Fare"].values)])

In [None]:
print("\nThe guy who paid the most:\n", data.iloc[np.argmax(data["Fare"])])

In [None]:
c = data.iloc[np.argmax(data["Fare"].values)]
c

In [None]:
type(c)

In [None]:
c1 = data.iloc[np.argmax(data["Fare"])]
c1

In [None]:
np.max(data["Fare"])

In [None]:
np.min(data["Fare"])

In [None]:
np.std(
    data["Fare"])  # Compute the standard deviation along the specified axis.

In [None]:
type(c1)

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

print("Boolean operations")

print('a = ', a)
print('b = ', b)
print("a > 2", a > 2)
print("numpy.logical_not(a > 2) = ", np.logical_not(a > 2))
print("numpy.logical_and(a > 2, b > 2) = ", np.logical_and(a > 2, b > 2))
print("numpy.logical_or(a > 2 , b < 3) = ", np.logical_or(a > 2, b < 3))

print("\n shortcuts")
print("~(a > 2) = ", ~(a > 2))  # Logical_not(a > 2)
print("(a > 2) & (b > 2) = ", (a > 2) & (b > 2))  # Logical_and
print("(a > 2) | (b < 3) = ", (a > 2) | (b < 3))  # Logical_or

Another NumPy feature we'll need is indexing: selecting elements from an array. 
Aside from python indexes and slices (e.g. a[1:4]), NumPy also allows you to select several elements at once.

In [None]:
a = np.array([0, 1, 4, 9, 16, 25])
ix = np.array([1, 2, 5])

print("a = ", a)
print("Select by element index:")
print("a[[1, 2, 5]] = ", a[ix])

print("\nSelect by boolean mask:")
print("a[a > 5] = ",
      a[a > 5])  # Select all elements in 'a' that are greater than 5
print("(a % 2 == 0) =", a % 2 == 0)  # True for even, False for odd
print("a[a % 2 == 0] =",
      a[a % 2 == 0])  # Select all elements in 'a' that are even

In [None]:
data[(data['Age'] < 18) & (data['Sex'] == 'male')]

In [None]:
data[(data['Sex'] == 'female') & (data['Sex'] == 'male')]

In [None]:
# Who on average paid more for their ticket, men or women?

mean_fare_men = np.mean(data[data["Sex"] == "male"]["Fare"])
mean_fare_women = np.mean(data[data["Sex"] == "female"]["Fare"])

print(mean_fare_men, mean_fare_women)

In [None]:
mean_fare = np.mean(data["Fare"])
mean_fare

In [None]:
# Who is more likely to survive: a child (< 18 y.o.) or an adult?

child_survival_rate = np.sum(data[data["Age"] < 18]["Survived"]) / np.shape(
    data[data["Age"] < 18])[0]
adult_survival_rate = np.sum(data[data["Age"] >= 18]["Survived"]) / np.shape(
    data[data["Age"] >= 18])[0]

print(child_survival_rate, adult_survival_rate)

In [None]:
data[data["Age"] < 18]["Survived"]

In [None]:
np.shape(data[data["Age"] < 18])

In [None]:
np.shape(data[data["Age"] < 18])[0]

In [None]:
# It may be interesting to you: Python vs NumPy.

big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

In [None]:
# It executes the operation in compiled code,
# NumPy's version of the operation is computed much more quickly.

# Be careful, though: the sum function and the np.sum function are not identical,
# which can sometimes lead to confusion! In particular, their optional arguments have different meanings,
# and np.sum is aware of multiple array dimensions, as we will see in the following section.

https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html

In [None]:
from math import pi
print('pi = ', pi)

In [None]:
# First array is from 0 to 1 000 000
# Second array is from 99 to 1 000 099

In [None]:
%%time
# ^-- this "magic" measures and prints cell computation time

# Option I: pure Python
arr_1 = range(1000000)
arr_2 = range(99, 1000099)

a_sum = []
a_prod = []
sqrt_a1 = []
sqrt_a2 = []
other_a = []
for i in range(len(arr_1)):
    a_sum.append(arr_1[i] + arr_2[i])
    a_prod.append(arr_1[i] * arr_2[i])
    sqrt_a1.append(arr_1[i]**0.5)
    sqrt_a2.append(arr_2[i]**0.5)
    other_a.append(((arr_1[i]**1.5) + (arr_2[i]**1.5)) * pi)

arr_1_sum = sum(arr_1)
arr_2_sum = sum(arr_2)

In [None]:
%%time

# Option II: start from Python, convert to NumPy
arr_1 = range(1000000)
arr_2 = range(99, 1000099)

arr_1, arr_2 = np.array(arr_1), np.array(arr_2)

a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1**.5
sqrt_a2 = arr_2**.5
other_a = (((arr_1**1.5) + (arr_2**1.5)) * pi)

arr_1_sum = arr_1.sum()
arr_2_sum = arr_2.sum()

In [None]:
%%time

# Option III: pure NumPy
arr_1 = np.arange(1000000)
arr_2 = np.arange(99, 1000099)

a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1**.5
sqrt_a2 = arr_2**.5
other_a = (((arr_1**1.5) + (arr_2**1.5)) * pi)

arr_1_sum = arr_1.sum()
arr_2_sum = arr_2.sum()

### Benchmarks of speed (NumPy vs all): http://brilliantlywrong.blogspot.com/2015/01/benchmarks-of-speed-numpy-vs-all.html

### <center><b> Some numbers from the link: </b></center>

<h1>
<table>
  <tr>
    <td>Pure Python</td>
    <td>183ms</td>
  </tr>
  <tr>
    <td>Numpy</td>
    <td>5.97ms</td>
  </tr>
  <tr>
    <td>Naive Cython</td>
    <td>7.76ms</td>
  </tr>
  <tr>
    <td>Optimised Cython</td>
    <td>2.18ms</td>
  </tr>
  <tr>
    <td>Cython calling C</td>
    <td>2.22ms</td>
  </tr>
</table>
</h1>

<h1>
<table>
  <tr>
    <td>Python</td>
    <td>9.51s</td>
  </tr>
  <tr>
    <td>Naive numpy</td>
    <td>64.7ms</td>
  </tr>
  <tr>
    <td>Numba</td>
    <td>6.72ms</td>
  </tr>
  <tr>
    <td>Cython</td>
    <td>6.57ms</td>
  </tr>
  <tr>
    <td>Parakeet</td>
    <td>12.3ms</td>
  </tr>
  <tr>
    <td>Cython</td>
    <td>6.57ms</td>
  </tr>
</table>
</h1>

In [None]:
# But let's go back now to our dataset.

In [None]:
data.Age.plot()

In [None]:
data.Age.plot(kind='hist')

In [None]:
data.Pclass.plot(kind='hist')

# Coffee Break?

## Part V: Matplotlib, plotly and seaborn

### Matplotlib

Using python to visualize the data is covered by different libraries: `matplotlib`, `plotly`, etc.

Just like python itself, matplotlib has an awesome tendency of keeping simple things simple while still allowing you to write complicated stuff with convenience (e.g. super-detailed plots or custom animations).

https://matplotlib.org/3.3.1/tutorials/index.html

https://github.com/rougier/matplotlib-tutorial

https://www.tutorialspoint.com/matplotlib/index.htm

https://realpython.com/python-matplotlib-guide/

https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# ^-- This "magic" selects specific matplotlib backend suitable for
# jupyter notebooks. For more info see:
# https://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1

# line plot
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])

In [None]:
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25], [0, 1, 3, 5, 10, 30])

In [None]:
# scatter-plot

import numpy as np

x = np.arange(5)
print("x = ", x)
print("x**2 = ", x**2)
print("plotting x ** 2 vs x:")
plt.scatter(
    x,
    x**2)  # A scatters plot of y vs. x with varying marker size and/or colour.

plt.show()  # Show the first plot and begin drawing next one
plt.plot(x, x**2)

In [None]:
# scatter-plot

x = np.arange(5)
print("x = ", x)
print("x**2 = ", x**2)
print("plotting x ** 2 vs x:")
plt.scatter(x, x**2)

#plt.show()  # Show the first plot and begin drawing next one
plt.plot(x, x**2)

In [None]:
# Draw a scatter plot with custom markers and colours
plt.scatter([1, 1, 2, 3, 4, 4.5], [3, 2, 2, 5, 15, 24],
            c=["red", "blue", "orange", "green", "cyan", "gray"],
            marker=",")

plt.scatter([0.0, 0.1, 0.2], [2, 4, 16],
            c=["red", "blue", "orange"],
            marker="o")

# Without plt.show(), several plots will be drawn on top of one another
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25], c="blue")

plt.title("Title")
plt.xlabel("Text 1")
plt.ylabel("Text 2")

In [None]:
# Histogram - showing data density
plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9,
          10])  # Plot a histogram.
plt.show()

plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9, 10],
         bins=5)

In [None]:
plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9, 10],
         bins=30)
# Plot a histogram.

In [None]:
import pandas as pd

data = pd.read_csv("./data/train.csv", index_col='PassengerId')

data["Age"].plot(kind='hist')

In [None]:
data["Fare"].plot(kind='hist')

In [None]:
data["Age"].plot(kind='hist')
data["Fare"].plot(kind='hist')

# Do you know where is 'Age'?

In [None]:
data["Fare"].plot(kind='hist')
data["Age"].plot(kind='hist')

In [None]:
data["Fare"].plot(kind='hist')
plt.show()
data["Age"].plot(kind='hist')

In [None]:
# Plot a histogram of age and a histogram of ticket fares on separate plots

plt.subplot(211)  # nrows, ncols, index, ...
plt.hist(data["Age"])
plt.subplot(212)  # Add a subplot to the current figure.
plt.hist(data["Fare"])

In [None]:
# Plot a histogram of age and a histogram of ticket fares on separate plots

plt.subplot(221)  # nrows, ncols, index, ...
plt.hist(data["Age"])
plt.subplot(222)
plt.hist(data["Fare"])
plt.subplot(223)
plt.hist(data["Fare"])
plt.subplot(224)
plt.hist(data["Age"])

In [None]:
# Make a scatter plot of passenger age vs ticket fare

m_data = data[data["Sex"] == "male"]
f_data = data[data["Sex"] == "female"]
plt.scatter(m_data["Age"], m_data["Fare"], c='r')
plt.scatter(f_data["Age"], f_data["Fare"], c='g')

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Make data
u = np.linspace(0, 2 * np.pi, 100)
v = np.linspace(0, np.pi, 100)
x = 10 * np.outer(np.cos(u),
                  np.sin(v))  # Compute the outer product of two vectors
y = 10 * np.outer(np.sin(u), np.sin(v))
z = 10 * np.outer(np.ones(np.size(u)), np.cos(v))

# Plot the surface
ax.plot_surface(x, y, z)

plt.show()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Make data
u = np.linspace(0, 2 * np.pi, 100)
v = np.linspace(0, np.pi, 100)
x = 10 * np.outer(np.cos(u),
                  np.sin(v))  # Compute the outer product of two vectors
y = 10 * np.outer(np.sin(u), np.sin(v))
z = 10 * np.outer(np.zeros(np.size(u)), np.cos(v))

# Plot the surface
ax.plot_surface(x, y, z)

plt.show()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Make data
u = np.linspace(0, 2 * np.pi, 10)
v = np.linspace(0, np.pi, 10)
x = 10 * np.outer(np.cos(u), np.sin(v))
y = 10 * np.outer(np.sin(u), np.sin(v))
z = 10 * np.outer(np.ones(np.size(u)), np.cos(v))

# Plot the surface
ax.plot_surface(x, y, z)

plt.show()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Make data
u = np.linspace(0, 2 * np.pi, 5)
v = np.linspace(0, np.pi, 5)
x = 10 * np.outer(np.cos(u), np.sin(v))
y = 10 * np.outer(np.sin(u), np.sin(v))
z = 10 * np.outer(np.ones(np.size(u)), np.cos(v))

# Plot the surface
ax.plot_surface(x, y, z)

plt.show()

In [None]:
def lorenz(x, y, z, s=10, r=28, b=2.667):
    """
    Given:
       x, y, z: a point of interest in three-dimensional space
       s, r, b: parameters defining the Lorenz attractor
    Returns:
       x_dot, y_dot, z_dot: values of the Lorenz attractor's partial
           derivatives at the point x, y, z
    """
    x_dot = s * (y - x)
    y_dot = r * x - y - x * z
    z_dot = x * y - b * z
    return x_dot, y_dot, z_dot


dt = 0.01
num_steps = 10000

# Need one more for the initial values
xs = np.empty(
    num_steps + 1
)  # Return a new array of given shape and type, without initializing entries.
ys = np.empty(num_steps + 1)
zs = np.empty(num_steps + 1)

# Set initial values
xs[0], ys[0], zs[0] = (0., 1., 1.05)

# Step through "time", calculating the partial derivatives at the current point
# and using them to estimate the next point
for i in range(num_steps):
    x_dot, y_dot, z_dot = lorenz(xs[i], ys[i], zs[i])
    xs[i + 1] = xs[i] + (x_dot * dt)
    ys[i + 1] = ys[i] + (y_dot * dt)
    zs[i + 1] = zs[i] + (z_dot * dt)

# Plot
fig = plt.figure()
ax = fig.gca(
    projection='3d')  # Get the current axes, creating one if necessary.

ax.plot(xs, ys, zs, lw=0.5)
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")
ax.set_zlabel("Z Axis")
ax.set_title("Lorenz Attractor")

plt.show()

### Problem №4 (all answers are at the end of the notebook):

Read toothpaste sales data of each month and show it using a scatter plot.

Also, add a grid in the plot. gridline style should “–“.
The scatter plot should look like this.

<center>
<img src="https://pynative.com/wp-content/uploads/2019/01/matplotlib_and_pandas_exercise_4_show_scatter_plot.png" width="500" height="600">
</center>

You have this file: company_sales_data.csv.

In [None]:
# If the picture does not work for you (there is no Internet connection),
# you can uncomment these two lines below and then run the cell. The picture should appear.
# These two:

#from IPython.display import Image
#Image("./images/matplotlib_and_pandas_exercise_4_show_scatter_plot.png")

In [None]:
# If you don't have an Internet connection,
# you can uncomment these two lines below and then run them.
# The file will be uploaded to the place where you are now.
# And command 'mv' moves the file to 'data'.
# These lines:

#!wget -i https://pynative.com/wp-content/uploads/2019/01/company_sales_data.csv
#!mv company_sales_data.csv ./data/.

In [None]:
!ls -lhtr ./data/company_sales_data.csv

### Plotly

https://plotly.com/python/

https://www.kaggle.com/kanncaa1/plotly-tutorial-for-beginners

https://www.tutorialspoint.com/plotly/index.htm

In [None]:
import plotly.express as px
df = px.data.tips()  # Pre-prepared dataset.

fig = px.density_heatmap(df, x="total_bill", y="tip")
fig.show()

# In a density heatmap, rows of data_frame are grouped into coloured rectangular tiles
# to visualize the 2D distribution of an aggregate function histfunc (e.g. the count or sum) of the value z.

In [None]:
df

In [None]:
fig = px.density_heatmap(df,
                         x="total_bill",
                         y="tip",
                         nbinsx=20,
                         nbinsy=20,
                         color_continuous_scale="Viridis")

fig.show()

In [None]:
fig = px.density_heatmap(df,
                         x="total_bill",
                         y="tip",
                         marginal_x="histogram",
                         marginal_y="histogram")
fig.show()

In [None]:
fig = px.density_heatmap(df,
                         x="total_bill",
                         y="tip",
                         facet_row="sex",
                         facet_col="smoker")
fig.show()

In [None]:
fig = px.density_heatmap(df,
                         x="total_bill",
                         y="tip",
                         facet_row="day",
                         facet_col="time")
fig.show()

In [None]:
df = px.data.wind()

fig = px.scatter_polar(df, r="frequency", theta="direction")
fig.show()

In [None]:
df

In [None]:
fig = px.line_polar(df,
                    r="frequency",
                    theta="direction",
                    color="strength",
                    line_close=True,
                    color_discrete_sequence=px.colors.sequential.Plasma_r,
                    template="plotly_dark")

fig.show()

# In a polar line plot, each row of data_frame is represented as a vertex of a polyline mark in polar coordinates.

In [None]:
import plotly.figure_factory as ff

x, y = np.meshgrid(np.arange(0, 2, .2), np.arange(
    0, 2, .2))  # Return coordinate matrices from coordinate vectors.
u = np.cos(x) * y
v = np.sin(x) * y

fig = ff.create_quiver(x, y, u, v)  # Returns data for a quiver plot.
fig.show()

### Seaborn

https://seaborn.pydata.org
    
https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html

https://python-graph-gallery.com/seaborn/

https://elitedatascience.com/python-seaborn-tutorial

https://www.tutorialspoint.com/seaborn/index.htm

https://www.kaggle.com/kanncaa1/seaborn-tutorial-for-beginners

https://www.datacamp.com/community/tutorials/seaborn-python-tutorial

In [None]:
import seaborn as sns  # or sn

# Load the example flights dataset and convert to long-form.
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")

# Draw a heatmap with the numeric values in each cell.
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5,
            ax=ax)  # Plot rectangular data as a colour-encoded matrix.

In [None]:
flights_long

In [None]:
flights

In [None]:
ax = sns.heatmap(flights, cmap="YlGnBu")

In [None]:
sns.set(style="ticks")  # Set aesthetic parameters in one step.
tips = sns.load_dataset("tips")
g = sns.relplot(x="total_bill", y="tip", hue="day", data=tips)

# Figure-level interface for drawing relational plots onto a FacetGrid.

In [None]:
g = sns.relplot(x="total_bill", y="tip", hue="day", col="time", data=tips)

In [None]:
g = sns.relplot(x="total_bill",
                y="tip",
                hue="day",
                col="time",
                row="sex",
                data=tips)

In [None]:
g = sns.relplot(x="total_bill",
                y="tip",
                hue="time",
                size="size",
                palette=["b", "r"],
                sizes=(10, 100),
                col="time",
                data=tips)

In [None]:
fmri = sns.load_dataset("fmri")
g = sns.relplot(x="timepoint",
                y="signal",
                hue="event",
                style="event",
                col="region",
                kind="line",
                data=fmri)

In [None]:
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.boxplot(x=tips["total_bill"]
                 )  # Draw a box plot to show distributions to categories.

In [None]:
ax = sns.boxplot(x="day", y="total_bill", data=tips)

In [None]:
ax = sns.boxplot(x="day",
                 y="total_bill",
                 hue="smoker",
                 data=tips,
                 palette="Set3")

In [None]:
ax = sns.boxplot(x="day", y="total_bill", data=tips)
ax = sns.swarmplot(
    x="day", y="total_bill", data=tips,
    color=".25")  # Draw a categorical scatterplot with non-overlapping points.

In [None]:
ax = sns.violinplot(x="day",
                    y="total_bill",
                    hue="smoker",
                    data=tips,
                    palette="muted",
                    split=True)

[Kernel Density Estimation](https://mathisonian.github.io/kde/)

http://faculty.washington.edu/yenchic/18W_425/Lec6_hist_KDE.pdf

https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html

http://www.machinelearning.ru/wiki/images/5/51/Kitov-ML-eng-11-Kernel_density_estimation.pdf

https://deepai.org/machine-learning-glossary-and-terms/kernel-density-estimation

In [None]:
sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
g = sns.pairplot(iris,
                 hue="species")  # Plot pairwise relationships in a dataset.

In [None]:
g = sns.pairplot(iris, kind="reg")

In [None]:
sns.set(color_codes=True)
tips = sns.load_dataset("tips")
ax = sns.regplot(x="total_bill", y="tip",
                 data=tips)  # Plot data and a linear regression model fit.

## Part VI: SciPy

https://docs.scipy.org/doc/scipy/reference/tutorial/

https://www.guru99.com/scipy-tutorial.html

https://www.tutorialspoint.com/scipy/index.htm

https://scipy-lectures.org

https://www.edureka.co/blog/scipy-tutorial/

https://www.journaldev.com/18106/python-scipy-tutorial

https://www.dezyre.com/data-science-in-python-tutorial/scipy-introduction-tutorial

In [None]:
from scipy import linalg

# Define a square matrix
two_d_array = np.array([[4, 5], [3, 2]])

# Pass values to det() function
linalg.det(two_d_array)

In [None]:
two_d_array

In [None]:
from scipy import linalg
a = np.array([[1, 3, 5], [2, 5, 1], [2, 3, 8]])
a

In [None]:
b = linalg.inv(a)  # Compute the inverse of a matrix.
b

In [None]:
c = np.array([[1, 1, 1], [1, 1, 1]])
c

In [None]:
a = np.array([[1, 2], [3, 4]])
a

In [None]:
b = np.array([[5], [6]])
b

In [None]:
c = np.linalg.solve(a, b)  # Solve the equation ax = b for x.
c

In [None]:
from scipy import stats
from scipy.stats import norm

In [None]:
a = norm.rvs(size=5)  # Random variates of a given type.
a

In [None]:
a.mean()

In [None]:
type(a)

In [None]:
a.std()

In [None]:
a.var()

# Mean is the average of an element.

# Variance is a sum of squared differences from the mean divided by the number of elements.
# Variance = ∑(arr[i] – mean)^2 / n

# Standard Deviation is the square root of the variance
# Standard Deviation = sqrt(Variance)

In [None]:
import math

math.sqrt(a.var())

In [None]:
from scipy.stats import norm

In [None]:
r = norm.rvs(size=1000)

In [None]:
fig, ax = plt.subplots(1, 1)
ax.hist(r, density=True, histtype='stepfilled', alpha=0.2)
plt.show()

In [None]:
r.mean()

In [None]:
r.max()

In [None]:
r.min()

In [None]:
r.argmax()

In [None]:
r.shape

In [None]:
r[r.argmax()]

In [None]:
r[r.argmin()]

In [None]:
r.sum()

In [None]:
s = 0.0
for i in range(len(r)):
    s = s + r[i]
s

In [None]:
from scipy.stats import chisquare
a = chisquare([16, 18, 16, 14, 12, 12])  # Calculate a one-way chi-square test.
a

In [None]:
from scipy.optimize import minimize

In [None]:
def rosen(x):
    """The Rosenbrock function"""
    return sum(100.0 * (x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)

In [None]:
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
res = minimize(  # Minimization of a scalar function of one or more variables.
    rosen,
    x0,
    method='nelder-mead',
    options={
        'xatol': 1e-8,
        'disp': True
    })

In [None]:
print(res.x)

In [None]:
from scipy.optimize import curve_fit


def func(x, a, b, c):
    return a * np.exp(-b * x) + c

In [None]:
xdata = np.linspace(0, 4, 50)
y = func(xdata, 2.5, 1.3, 0.5)
np.random.seed(1729)
y_noise = 0.2 * np.random.normal(size=xdata.size)
ydata = y + y_noise
plt.plot(xdata, ydata, 'b-', label='data')

In [None]:
xdata = np.linspace(0, 4, 50)
y = func(xdata, 2.5, 1.3, 0.5)
#np.random.seed( 1729 )
y_noise = 0.2 * np.random.normal(size=xdata.size)
ydata = y + y_noise
plt.plot(xdata, ydata, 'b-', label='data')

In [None]:
popt, pcov = curve_fit(
    func, xdata,
    ydata)  # Use non-linear least squares to fit a function, f, to data.
popt

In [None]:
plt.plot(xdata,
         func(xdata, *popt),
         'r-',
         label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('x')
plt.ylabel('y')
plt.legend()  # For Matplotlib: place a legend on the axes.
plt.show()

In [None]:
# Constrain the optimization to the region of 0 <= a <= 3, 0 <= b <= 1 and 0 <= c <= 0.5:

In [None]:
popt, pcov = curve_fit(func, xdata, ydata, bounds=(0, [3., 1., 0.5]))
popt

In [None]:
plt.plot(xdata,
         func(xdata, *popt),
         'g--',
         label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

In [None]:
xdata = np.linspace(0, 4, 50)
y = func(xdata, 2.5, 1.3, 0.5)
np.random.seed(1729)
y_noise = 0.2 * np.random.normal(size=xdata.size)
ydata = y + y_noise
plt.plot(xdata, ydata, 'b-', label='data')

popt, pcov = curve_fit(func, xdata, ydata)

plt.plot(xdata,
         func(xdata, *popt),
         'r-',
         label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
#plt.show()

popt, pcov = curve_fit(func, xdata, ydata, bounds=(0, [3., 1., 0.5]))

plt.plot(xdata,
         func(xdata, *popt),
         'g--',
         label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
#plt.show()

In [None]:
# Seed the random number generator for reproducibility
np.random.seed(0)

x_data = np.linspace(-5, 5, num=50)
y_data = 2.9 * np.sin(1.5 * x_data) + np.random.normal(size=50)

# And plot it
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data)

In [None]:
from scipy import optimize


def test_func(x, a, b):
    return a * np.sin(b * x)


params, params_covariance = optimize.curve_fit(test_func,
                                               x_data,
                                               y_data,
                                               p0=[2, 2])

print(params)

In [None]:
def test_func(x, a, b):
    return a * np.sin(b * x)


params, params_covariance = optimize.curve_fit(test_func, x_data, y_data)

print(params)

In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(x_data, y_data, label='Data')
plt.plot(x_data,
         test_func(x_data, params[0], params[1]),
         label='Fitted function')

plt.legend(loc='best')

plt.show()

In [None]:
# Various options for the position of the legend.
'''
best
upper right
upper left
lower left
lower right
right
center left
center right
lower center
upper center
center
'''

In [None]:
# Seed the random number generator for reproducibility
#np.random.seed(0)

# Generate some data for this demonstration.
data = norm.rvs(10.0, 2.5, size=500)  # + np.random.normal(size=500)

# Fit a normal distribution to the data:
mu, std = norm.fit(data)

# Plot the histogram.
plt.hist(data, bins=25, density=True, alpha=0.6, color='g')

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)  # Probability density function.
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)
plt.figure(figsize=(6, 4))

plt.show()

In [None]:
abs(mu - np.mean(data)) < 0.01

In [None]:
abs(std - np.std(data, ddof=1)) < 0.01

In [None]:
abs(10.0 - np.mean(data)) < 0.01

In [None]:
abs(2.5 - np.std(data, ddof=1)) < 0.01

In [None]:
abs(2.5 - np.std(data, ddof=1)) < 0.1

## Part VII: Additional materials

#### Bokeh
Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords high-performance interactivity over large or streaming datasets. Bokeh can help anyone who would like to quickly and easily make interactive plots, dashboards, and data applications.
https://docs.bokeh.org/en/latest/

#### Numba
Numba provides the ability to speed up applications with high-performance functions written directly in Python, rather than using language extensions such as Cython.

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

You don't need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.
http://numba.pydata.org

#### Desk
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love.
https://dask.org

<center>
<img src="https://meme-generator.com/wp-content/uploads/mememe/2019/11/mememe_cb8e239ef97eb73a7d04ecf46ed4bf5c-1.jpg" alt="YOU-ROCK" width="500" height="600">
</center>

In [None]:
# If the picture does not work for you (there is no Internet connection),
# you can uncomment these two lines below and then run the cell. The picture should appear.
# These two:

#from IPython.display import Image
#Image("./images/mememe_cb8e239ef97eb73a7d04ecf46ed4bf5c-1.jpg")

### Here you can see a lot of tasks and solutions for your Python practise:

https://www.w3resource.com/python-exercises/

https://pynative.com/python-exercises-with-solutions/

https://www.machinelearningplus.com/python/101-numpy-exercises-python/

https://www.machinelearningplus.com/python/101-pandas-exercises-python/

Theory, not practice: https://www.machinelearningplus.com/plots/matplotlib-tutorial-complete-guide-python-plot-examples/

## Answers to problems:

In [None]:
# Answer to problem №0:

a = np.array([[1, 0], [1, 2]])

print("Original 2-d array:")
print(a)

print("Determinant of the said 2-D array:")
print(np.linalg.det(a))

In [None]:
# Answer to problem №1:

print("Printing original array")
sampleArray = np.array([[34, 43, 73], [82, 22, 12], [53, 94, 66]])
print(sampleArray, '\n')

print("Array after deleting column 2 on axis 1")
sampleArray = np.delete(sampleArray, 1, axis=1)
print(sampleArray, '\n')

newColumn = np.array([[10, 10, 10]])

print("Array after inserting column 2 on axis 1")
sampleArray = np.insert(sampleArray, 1, newColumn, axis=1)
print(sampleArray)

In [None]:
# Answer to problem №2:

exam_data = {
    'name': [
        'Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
        'Matthew', 'Laura', 'Kevin', 'Jonas'
    ],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify':
    ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data, index=labels)
print(df, '\n')

total_rows = len(df.axes[0])
total_cols = len(df.axes[1])

print("Number of Rows: " + str(total_rows))
print("Number of Columns: " + str(total_cols))

In [None]:
# Answer to problem №3:

exam_data = {
    'name': [
        'Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
        'Matthew', 'Laura', 'Kevin', 'Jonas'
    ],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify':
    ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data, index=labels)
print(df)

print("\nMean score for each different student in data frame:")
print(df['score'].mean())

In [None]:
# Answer to problem №4:

df = pd.read_csv("./data/company_sales_data.csv")

monthList = df['month_number'].tolist()
toothPasteSalesData = df['toothpaste'].tolist()

plt.scatter(monthList, toothPasteSalesData, label='Tooth paste Sales data')
plt.xlabel('Month Number')
plt.ylabel('Number of units Sold')
plt.legend(loc='upper left')
plt.title('Tooth paste Sales data')
plt.xticks(monthList)
plt.grid(True, linewidth=1, linestyle="--")
plt.show()

# The notebook is based on https://github.com/HSE-LAMBDA/MLatMIPS-2020 (Mosphys 2020, https://mosphys.ru) and many other links from the Internet.