# HSE 2023: Mathematical Methods for Data Analysis

## Homework 1

### Attention!
* For tasks where <ins>text answer</ins> is required **Russian language** is **allowed**.
* If a task asks you to describe something (make conclusions) then **text answer** is **mandatory** and **is** part of the task
* **Do not** upload the dataset (titanic.csv) to the grading system (we already have it)
* We **only** accept **ipynb** notebooks. If you use Google Colab then you'll have to download the notebook before passing the homework
* **Do not** use python loops instead of NumPy vector operations over NumPy vectors - it significantly decreases performance (see why https://blog.paperspace.com/numpy-optimization-vectorization-and-broadcasting/), will be punished with -0.25 for **every** task.
Loops are only allowed in part 1 (Tasks 1 - 4).
* Some tasks contain tests. They only test your solution on a simple example, thus, passing the test does **not** guarantee you the full grade for the task.

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Python (2 points)

**Task 1** (0.5 points)<br/>
Enter nonzero numbers `a`, `d` and `N`. Find the sum of the first `N` members of [harmonic sequence](https://en.wikipedia.org/wiki/Harmonic_progression_(mathematics)) with the first member denominator `a`

In [3]:
def find_sum(a: int, d: int, N: int) -> float:
    result = 0
    for i in range(N):
        result += 1 / (a + i * d)
    return result

a = 1
d = 1
N = 10
print(find_sum(a, d, N))

2.9289682539682538


**Task 2** (0.5 points) <br/>
Enter an integer number `N`. Check if it is a palindrome number **without converting it to the string**. It means that it can be read equally from left to right and from right to back.

In [4]:
def check_palindrome(N):
    copy_of_N = N
    tmp = 0
    while copy_of_N != 0:
        tmp = tmp * 10 + copy_of_N%10
        copy_of_N //= 10

    check_palindrome = tmp == N 
        
    return check_palindrome

for N in [3, 81, 111, 113, 810, 2022, 4774, 51315, 611816]:
    print(N, check_palindrome(N))

3 True
81 False
111 True
113 False
810 False
2022 False
4774 True
51315 True
611816 False


**Task 3** (0.5 points) <br/>
Find the first `N` palindrome numbers starting from 42 (you may use the function from the precious task).

In [5]:
def find_palindromes(N):
    i = 42
    counter = 0
    palindromes = []
    
    while counter != N:
        if check_palindrome(i):
           palindromes.append(i)
           counter += 1
        i += 1
            
    return palindromes

print(find_palindromes(3))
print(find_palindromes(13))
print(find_palindromes(23))

[44, 55, 66]
[44, 55, 66, 77, 88, 99, 101, 111, 121, 131, 141, 151, 161]
[44, 55, 66, 77, 88, 99, 101, 111, 121, 131, 141, 151, 161, 171, 181, 191, 202, 212, 222, 232, 242, 252, 262]


**Task 4** (0.5 points) <br/>
There are numbers: `a`, `b`, `c`. Without using functions `min`, `max` and other functions, find the median number.

In [6]:
from random import randint

def find_median(a, b, c):
    result = 0
    if a >= b:
        if a < c:
            result = a
        else:
            result = b if b > c else c
    else:
        if a > c:
            result = a
        else:
            result = b if b < c else c
        
    return result

for i in range(10):

    a = randint(-100, 100)
    b = randint(-100, 100)
    c = randint(-100, 100)

    print(a, b, c, '\tMedian:', find_median(a, b, c))

28 39 3 	Median: 28
22 95 94 	Median: 94
24 -32 -91 	Median: -32
-62 -100 82 	Median: -62
22 21 -73 	Median: 21
-6 -84 -35 	Median: -35
-84 0 -26 	Median: -26
-94 -51 -58 	Median: -58
8 -61 75 	Median: 8
98 62 -75 	Median: 62


# 2. Numpy (4 points)

**Task 1** (0.5 points) <br/>
Create a random array from Gaussian distribution with length of 12 and with sum of its elements equal to 15. Distribution shall be left Gaussian

In [31]:
sum = 15
my_array = np.random.normal(0, 1, 12)
nf = sum / np.sum(my_array)
my_array *= nf
print(f'Length: {len(my_array)}')
print(f'Sum of elements: {np.sum(my_array)}')

Length: 12
Sum of elements: 15.000000000000002


**Task 2** (0.5 points) <br/>
Create two random arrays $a$ and $b$ with the same length.

Calculate the following distances between the arrays **without using any special function. You may only use basic numpy operations (`np.linalg.*` and other high-level ones are prohibited).**:

* Manhattan Distance
$$ d(a, b) = \sum_i |a_i - b_i| $$
* Euclidean Distance
$$ d(a, b) = \sqrt{\sum_i (a_i - b_i)^2} $$
* Chebyshev Distance
$$ d(a, b) = \underset{i}{max} |a_i - b_i| $$
* Cosine Distance
$$ d(a, b) = 1 - \frac{a^\top b}{||a||_2\cdot||b||_2} $$


In [132]:
def calculate_manhattan(a, b):
    distance = np.sum(np.abs(a - b))
    return distance

def calculate_euclidean(a, b):
    distance = np.sqrt(np.sum(np.square(a - b)))
    return distance

def calculate_chebyshev(a, b):
    distance = np.max(np.abs(a - b))
    return distance

def calculate_cosine(a, b):
    cosine_similarity = np.sum(a * b) / (np.sqrt(np.sum(a**2)) * np.sqrt(np.sum(b**2)))
    distance = 1 - cosine_similarity
    return distance

In [131]:
a = np.random.rand(10)
b = np.random.rand(10)
print(f'Manhattan distance: {calculate_manhattan(a, b)}')
print(f'Euclidean distance: {calculate_euclidean(a, b)}')
print(f'Chebyshev distance: {calculate_chebyshev(a, b)}')
print(f'Cosine distance: {calculate_cosine(a, b)}')

Manhattan distance: 2.907643172875607
Euclidean distance: 1.0698559489374087
Chebyshev distance: 0.5630794980528417
Cosine distance: 0.15370675234608633


**Task 3** (0.5 points) <br/>
Create a random array of integers with length of 42. Transform the array so that
* Maximum element(s) value is 10
* Minimum element(s) value is -6
* Other values are in interval (-6; 10) without changing the relative order (relation $\frac{x_i}{x_{i-1}}=\frac{\widehat{x}_{i}}{\widehat{x}_{i-1}}$ holds)

In [158]:
def transform(array):
    max = 10
    min = -6
    tmp = (array - np.min(array)) / (np.max(array) - np.min(array))
    transformed_array = tmp * (max - min) + min
    return transformed_array

In [159]:
my_array = np.random.randint(0, 100, 42)
print(my_array)
my_array = transform(my_array)
print(f'Min: {np.min(my_array)}')
print(f'Max: {np.max(my_array)}')
print('Array:')
print(my_array)

[77 39 36 44 56 24 87 96 30 81  6 22  4 74 18 39 87 48 89 68 14 81  7 95
 48 99 94  9 26 52 93 47 64 66 77 64 33  7 45 94 66 43]
Min: -6.0
Max: 10.0
Array:
[ 6.29473684 -0.10526316 -0.61052632  0.73684211  2.75789474 -2.63157895
  7.97894737  9.49473684 -1.62105263  6.96842105 -5.66315789 -2.96842105
 -6.          5.78947368 -3.64210526 -0.10526316  7.97894737  1.41052632
  8.31578947  4.77894737 -4.31578947  6.96842105 -5.49473684  9.32631579
  1.41052632 10.          9.15789474 -5.15789474 -2.29473684  2.08421053
  8.98947368  1.24210526  4.10526316  4.44210526  6.29473684  4.10526316
 -1.11578947 -5.49473684  0.90526316  9.15789474  4.44210526  0.56842105]


**Task 4** (0.5 points) <br/>
Create an array with shape of $10 \times 3$ with integers from [-12, 4]. Find a column that contains the minimum element of the array.

In [204]:
shape = (10, 3)
min = -12
max = 4
my_array = np.random.randint(min, max + 1, shape)
selected_column = np.where(my_array == np.min(my_array))[1][0]
print('Shape: ', my_array.shape)
print('Array:')
print(my_array)
print(f'Selected column: {selected_column}')

Shape:  (10, 3)
Array:
[[  4  -1  -5]
 [ -3  -3  -2]
 [  2  -9  -2]
 [ -1  -1  -7]
 [ -5  -8  -9]
 [ -3   2 -11]
 [ -1  -5  -8]
 [ -7  -6  -2]
 [  4 -11 -10]
 [ -5  -4  -3]]
Selected column: 2


**Task 5** (0.5 points) <br/>

Replace all missing values in the following array with the most appropriate quantile, explain why you chose it.

In [333]:
arr = np.random.rand(10)
idx = np.random.randint(0, 10, 4)
arr[idx] = np.nan

print('Array:')
print(arr)

Array:
[0.9316538  0.99626765 0.91193299 0.7174666  0.50246522 0.26793081
        nan        nan        nan 0.59859942]


In [328]:
def replace_missing(arr):
    median = np.nanmedian(arr)
    arr[np.isnan(arr)] = median
    array_without_missing = arr
    return array_without_missing

In [329]:
arr = replace_missing(arr)
print('Array with no missing values:')
print(arr)

Array with no missing values:
[0.39916616 0.39920935 0.39916616 0.39916616 0.39916616 0.23076419
 0.98432515 0.39912297 0.37619695 0.93802117]


**Answer:** Я выбрал заменить на медиану, потому что, пропущенных значений много (40%), что говорит о том, что наша замена сильно скажется на существующей выборке, при таком большом количестве Nan я бы задумался об удалении данного параметра,но надо заменить, поэтому встает выбор между медианой и средним значением. Но при небольшой выборке (10 значений) может получится так, что среднее значение создат выброс и будет обучать на данных, которых не было изначально. Как я уже сказал, выбор я остановил на медиане.

**Task 6** (0.5 points) <br/>
Create a function which takes an image ```X``` (3D array of the shape (n, m, 3)) as an input and returns the median and std for every channel (you should get a vector of shape 3, RGB).

In [292]:
def stats_channel(X):
    median = np.median(X, axis=(0, 1))
    std = np.std(X, axis=(0, 1))
    return median, std

In [302]:
n = 19
m = 23
X = np.random.randint(-11, 8, size=(n, m, 3))
print(stats_channel(X))

(array([-2., -2., -2.]), array([5.55453423, 5.56107201, 5.50556009]))


### **Task 7** (1 points) <br/>
Create a function which takes a 3D matrix ```X``` as an input and returns all its unique values along the first axis.

Sample input:

```
np.array(
  [[[1, 2, 3],
    [1, 2, 3],
    [1, 2, 3]],

   [[4, 5, 6],
    [4, 5, 7],
    [4, 5, 6]],

   [[7, 8, 9],
    [7, 8, 9],
    [7, 8, 9]]]
)
```
  
Sample output:

```
np.array(
  [[[1, 2, 3],
    [1, 2, 3]],

   [[4, 5, 6],
    [4, 5, 7]],

   [[7, 8, 9],
    [7, 8, 9]]]
)
```

In [297]:
def get_unique_values(X):
    result = np.unique(X, axis=1)
    return result

In [331]:
X =  np.random.randint(4, 6, size=(n, 3, 3))
print('Matrix:')
print(X)
print('Unique :')
get_unique_values(X)

Matrix:
[[[4 4 5]
  [5 5 4]
  [4 5 4]]

 [[5 4 4]
  [5 5 5]
  [5 5 4]]

 [[5 5 5]
  [5 5 5]
  [4 5 4]]

 [[4 4 4]
  [5 5 4]
  [4 4 5]]

 [[5 4 4]
  [4 4 4]
  [4 5 4]]

 [[5 5 4]
  [5 5 4]
  [5 4 4]]

 [[4 5 5]
  [4 4 5]
  [4 4 4]]

 [[4 4 5]
  [5 4 4]
  [4 5 5]]

 [[5 5 5]
  [5 4 4]
  [5 5 4]]

 [[4 4 5]
  [5 5 4]
  [4 5 4]]

 [[5 4 5]
  [4 5 5]
  [5 4 4]]

 [[4 4 5]
  [5 4 5]
  [4 5 5]]

 [[4 4 5]
  [4 4 5]
  [4 5 5]]

 [[5 5 5]
  [5 4 4]
  [5 4 4]]

 [[4 5 4]
  [5 5 4]
  [4 4 4]]

 [[4 4 5]
  [4 5 4]
  [5 4 5]]

 [[5 4 4]
  [4 5 4]
  [4 4 5]]

 [[5 5 4]
  [5 4 4]
  [5 5 4]]

 [[5 5 4]
  [5 4 4]
  [5 4 5]]]
Unique :


array([[[4, 4, 5],
        [4, 5, 4],
        [5, 5, 4]],

       [[5, 4, 4],
        [5, 5, 4],
        [5, 5, 5]],

       [[5, 5, 5],
        [4, 5, 4],
        [5, 5, 5]],

       [[4, 4, 4],
        [4, 4, 5],
        [5, 5, 4]],

       [[5, 4, 4],
        [4, 5, 4],
        [4, 4, 4]],

       [[5, 5, 4],
        [5, 4, 4],
        [5, 5, 4]],

       [[4, 5, 5],
        [4, 4, 4],
        [4, 4, 5]],

       [[4, 4, 5],
        [4, 5, 5],
        [5, 4, 4]],

       [[5, 5, 5],
        [5, 5, 4],
        [5, 4, 4]],

       [[4, 4, 5],
        [4, 5, 4],
        [5, 5, 4]],

       [[5, 4, 5],
        [5, 4, 4],
        [4, 5, 5]],

       [[4, 4, 5],
        [4, 5, 5],
        [5, 4, 5]],

       [[4, 4, 5],
        [4, 5, 5],
        [4, 4, 5]],

       [[5, 5, 5],
        [5, 4, 4],
        [5, 4, 4]],

       [[4, 5, 4],
        [4, 4, 4],
        [5, 5, 4]],

       [[4, 4, 5],
        [5, 4, 5],
        [4, 5, 4]],

       [[5, 4, 4],
        [4, 4, 5],
        [4, 5, 4]]

# 3. Pandas & Visualization (4 points)


You are going to work with *Titanic* dataset which contains information about passengers of Titanic:
- **Survived** - 1 - survived, 0 - died (0); **Target variable**
- **pclass** - passengers's class;
- **sex** - passengers's sex
- **Age** - passengers's age in years
- **sibsp**    - is the passenger someones siblings   
- **parch**    - is the passenger someones child or parent
- **ticket** - ticket number    
- **fare** - ticket price    
- **cabin** - cabin number
- **embarked** - port of Embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

**Note** for all visualizations use matplotlib or seaborn but NOT plotly! Plotly's graphics sometimes vanish after saving. In this case the task won't be graded.

**Note** support all your answers with necessary code, computations, vizualization, and explanation. Answers without code and explanation won't be graded.

**Task 0** (0 points) \
Load the dataset and print first 6 rows

In [432]:
dataset =  pd.read_csv("titanic.csv", index_col=0)
dataset.head(6)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


**Task 1** (1 points) <br/>
Answer the following questions:
    
    * Are there any missing values? In what columns?
    * What is the percentage of survived passengers? Are the classes balanced?
    * Were there more males or females?
    * What what the least popular port of embarkation?
    * How many passenger classes (pclass) were there on Tinanic?
    * What is the overall average ticket fare? And for every passenger class?
Please, write not only the answers but the code, proving it.

In [389]:
print("\t1) As we see, there are Nan-values in columns \"Age\", \"Cabin\", \"Embarked\"")
columns_with_Nan = dataset.isna().sum()
print(columns_with_Nan[columns_with_Nan != 0])

print("\t2) Percentage of survived passengers is 38. Classes are not balanced, because the number of dead exceeds the number of survivors")
survived = dataset.Survived
print(f'{round(survived[survived == 1].shape[0] / dataset.shape[0] * 100)}%')

print("\t3) There are more males than females")
print(f'Male = {dataset[dataset.Sex == "male"].shape[0]} > Female = {dataset[dataset.Sex == "female"].shape[0]}')

print("\t4) Least popular port of embarkation was Queenstown")
print(f'{dataset.Embarked.value_counts().idxmin()}')

print(f"\t5) There were {dataset.Pclass.nunique()} classes")

print(f"\t6) The overall average ticket fare was: {dataset.Fare.mean():.2f}")
print(dataset.groupby('Pclass').Fare.mean())

	1) As we see, there are Nan-values in columns "Age", "Cabin", "Embarked"
Age         177
Cabin       687
Embarked      2
dtype: int64
	2) Percentage of survived passengers is 38. Classes are not balanced, because the number of dead exceeds the number of survivors
38%
	3) There are more males than females
Male = 577 > Female = 314
	4) Least popular port of embarkation was Queenstown
Q
	5) There were 3 classes
	6) The overall average ticket fare was: 32.20
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64


**Task 2** (0.5 points) <br/>
Visualize age distribution (you may use a histogram, for example).

    * What is the minimal and maximum ages of the passengers? Visualize it on the plot
    * What is the median age? And among males and females separately? Visualize it on the separate plot
    * Make conclusions about what you see on the plots

In [None]:
## Your code here

**Task 3** (1 points) <br/>
Find all the titles of the passengers (example, *Capt., Mr.,Mme.*), which are written in the column Name, and answer the following questions:

    * How many are there unique titles?
    * How many are there passengers with every title?
    * What is the most popular man's title? And woman's title?
    
**Hint** You may select the title from the name as a word which contains a dot and is not middle name.

In [440]:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
print(f"1) Number of unique titles is {dataset.Title.nunique()}")

print(f"2) There is no passengers with each titles, because there are titles belong to only one person \n{dataset.Title.value_counts()}")

print(f"Men's most popular title: {dataset.Title[dataset.Sex == 'male'].Title.value_counts().idxmax()}")
print(f"Women's most popular title: {dataset.Title[dataset.Sex == 'female'].Title.value_counts().idxmax()}")

1) Number of unique titles is 17
2) There is no passengers with each titles, because there are titles belong to only one person 
Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64


AttributeError: 'Series' object has no attribute 'Title'

**Task 4** (0.5 points) <br/>
Is there correlation between *pclass* and *ticket price*? Calculate mean price for each class and visualize price distribution for each class as well. Make conclusions about what you see on the plot

Hint: you could make one or several plot types i.e.  box, violine, pair, histogram (see additional notebooks for Seminar 1 "Visualization with Seaborn"). Main point here is to **choose** plots wisely and **make meaningful conclusions**



In [None]:
## You code here

**Task 5** (0.5 points) <br/>
The same question as in task 4 about correlation between *embarked* and *ticket priсe*.

In [None]:
## You code here

**Task 6** (0.5 points) <br/>
Visualize age distribution for survived and not survived passengers separately and calculate the mean age for each class. Are they different? Provide the same visualization for males and females separately. Make conclusions about what you see on the plots

In [None]:
## You code here