## Helpful links

**Getting started**

- Anaconda (a distribution for Python that includes Jupyter Notebook and Python itself): https://www.anaconda.com/download/ - Python version 3.6 or higher is recommended; don't forget to select the appropriate operating system!
- jupyter documentation: https://jupyter.readthedocs.io/en/latest/
- PIP documentation (for installing packages in Python using pip install): https://pip.readthedocs.io/en/latest/
- PyCharm: https://www.jetbrains.com/pycharm/

**Common**

- for any questions: https://www.google.com/
- for (almost) any answers: https://stackoverflow.com/
- [a professional information and analytical resource dedicated to machine learning, pattern recognition, and data mining](http://www.machinelearning.ru/wiki/index.php?title=%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0)
- A visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

**Python & Jupyter**

- A Crash Course in Python for Scientists: http://nbviewer.jupyter.org/gist/rpmuller/5920182
- A Gallery of interesting Jupyter Notebooks: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
- Markdown Cheatsheet: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

**pandas**

- documentation: http://pandas.pydata.org/pandas-docs/stable/
- 10 minutes to pandas: https://pandas.pydata.org/pandas-docs/stable/10min.html
- Pandas Tutorial: DataFrames in Python: https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
- Cheet Sheet: https://www.analyticsvidhya.com/blog/2015/07/11-steps-perform-data-analysis-pandas-python/
- Visualization: http://pandas.pydata.org/pandas-docs/stable/visualization.html

**sklearn**

- documentation and more: http://scikit-learn.org/stable/

**Other libraries**

- matplotlib: https://matplotlib.org/users/pyplot_tutorial.html
- seaborn: http://seaborn.pydata.org/

## Lab 1: working with Pandas.

Pandas is a Python library that provides extensive data analysis capabilities. It is very convenient to load, process, and analyze table data using SQL-like queries.

In [39]:
import pandas as pd

The main data structures in Pandas are the Series and DataFrame classes. The first one is a one-dimensional indexed data array of some fixed type. The second is a two-dimensional data structure, which is a table with each column containing data of the same type. You can represent it as a dictionary of objects of the Series type.

We will use the Pandas library to analyze data. We will work with credit data of a bank, which is interested in whether the payment will be delayed by 90 days or more when issuing a loan.

### 1
Read the data from the data.csv file

*Functions that can be useful: `pd.read_csv(..., delimiter=',')`*

In [1]:
# place for code


### 2
Output a description of the read data. 

*Functions that can be useful:`.describe()`*

In [2]:
# place for code


### 3
Display the first and last few entries.

*Functions that can be useful for solving:`.head(), .tail()`*

*What parameters can be passed to these functions?*

In [None]:
# place for code


### 4
Open the file `DataDictionary-ru.txt` in a text editor and read what the matrix columns mean. Write below what type does each column belong to (real, integer, categorical)?

In [None]:
# place for code


### 5

Note that the `DebtRatio`column contains improbable data. Only values that correspond to a known monthly income are ratios. The rest are absolute values of monthly credit payments. 

Correct the data by making all values of the `DebtRatio` column absolute (multiply ratios by `MonthlyIncome`).  To make your program run quickly on full data, try not to use a loop.

#### *Functions that can be useful:*

Accessing DataFrame elements:
  * element: `data.loc[i, 'columnName']`
  * column: `data['columnName']`
  * sub-matrix: `data.loc[a:b, columnNameList]`

Conditional indexing:
* `data.loc[data['columnName'] > 20, columnNameList]`

it's better to write like this:

* `i = data['columnName'] > 20`  # i - boolean array of `True` and `False`
* `data.loc[i, columnNameList]`

Sub-matrix line numbers are inherited from the original.

* `pandas.isnull(scalar or array)` - checking whether the value is undefined (`NaN`)
* `pandas.notnull(scalar or array)` - checking whether the value is defined (not `NaN`)

In [22]:
# place for code


### 6

Change the column name to `Debt`.

*Functions that can be useful: `.rename(columns={'oldName':'newName'}, inplace=True)`*

In [24]:
# place for code


### 7

Calculate the average monthly income and assign the resulting number to all clients with unknown (or zero) income.

*Functions that can be useful: `.mean()`*

*Other descriptive statistics:* https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

In [None]:
# place for code


### 8

Using the `groupby` method, estimate the probability of non-repayment of the loan (`SeriousDlqin2yrs=1`) for different values of the number of dependents (`NumberOfDependents`).  

Follow the same procedure for different values of the column `NumberRealEstateLoansOrLines`

*Help:*
`data['column1'].groupby(data['column2']).mean()`  *-- calculating the average values of column1 for groups of column2*

In [None]:
# place for code


## Визуализация данных

In [3]:
import matplotlib.pyplot as plt

# enable graphs output directly to this notebook
%matplotlib inline

Matplotlib makes it easy to visualize tabular data.

*Functions that can be useful:*

* Drawing:
   * `plt.plot(x, y)`  see more http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot
   * `plt.show()`
   * `plt.scatter(x, y)` - scatter plot, see http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter
   * `plt.hist()` - histogram, see http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist

* Drawing multiple graphs on one:

  `fig, ax = plt.subplots()
   ax.hist(...)
   ax.hist(...)
   plt.show()`
   
* Logarithmic scale:
    * `ax.set_xscale('log')`  или `ax.set_yscale('log')`
* The reduction of the graph area:
    * `ax.axis([x1, x2, y1, y2])`


### 9a

Plot the scattering graph on the `age` and `Debt` axes. Mark clients without serious debts in blue (`SeriousDlqin2yrs = 0`) and debtors in red  (`SeriousDlqin2yrs = 1`).

In [None]:
# place for code


### 9b
Plot two **normalized** distribution densities on the same graph: the red one for the monthly income of clients with arrears, and the blue one for the monthly income of clients without arrears. On the abscissa axis, display values up to 25000.

In [None]:
# place for code


### 9c*
Visualize pairwise dependencies between the non-binary features: 'age', 'MonthlyIncome', and 'NumberOfDependents'. Limit monthly income to 25,000.

What regularities can you observe in the resulting charts?

*Functions that can be useful: 'pd.plotting.scatter_matrix()`*

In [None]:
# place for code
