# The Basics
## What is Machine Learning?
---
### Machine Learning Overview
Machine learning is a way of taking data and turning it into insights. We use computer examples to __analyze__ examples from the past to __build__ a model that can predict the result for new examples. Machine learning can be used to create chat bots, detect spam or produce image recognition.
One of the most common programming language for machine learning is Python. There are a number of packages useful for dealing with data and building machine learning models:
- Pandas is used for reading data and data manipulation
- Numpy is used for computations of numerical data
- Matplotlib is used for graphing data
- Scikit-learn is used for machine learning models

### Course Contents
In machine learning, there is supervised and unsupervised learning. __Supervised__ learning is when we have a know target based on past data and __unsupervised__ learning is when there isn't a known past answer.

Machine learning problems can also be separated into classification and regression problems. __Regression__ is predicting a numeric value, whereas __classification__ is predicting what class something belongs to.

This course focuses on __supervised classification__ problems. Our examples include:
- Predicting who would survive the Titanic crash
- Determining a handwritten digit from an image
- Using biopsy data to classify if a lump is cancerous

We'll be using a number of techniques to tackle these problems:
- Logistic regression
- Decision trees
- Random forests
- Neural networks

## Statistics Review
---
### Average

In [1]:
ages = [15, 16, 18, 19, 22, 24, 29, 30, 34]
sum(ages) / len(ages)  # average

23.0

In [2]:
ages[len(ages) // 2]  # median

22

_In statistics, both the **mean** and the **median** are called averages. The layman's average is the mean._

### Percentiles
The median can be thought of the __50th percentile__. This means that 50% of data is less than the median and 50% of the data is greater than the median. This only tells us the middle of the data, we'll often also look at the __25th__ percentile and __75th__ percentile.
The 25th percentile and the 75th percentile in `ages` is `18` and `29` respectively. The full range of the data is between 15 and 34. Half of them is between 18 and 29. This helps us to gain an understanding of the distribution of the data.

_If there is an even number of data points, to find the median, take the mean of the two values in the middle._

### Standard Deviation and Variance
The standard deviation and variance are measures of how dispersed or spread out the data is. Measure how far each datapoint is from the mean.

The distance of each value from the mean 23.0 is:
> 8, 7, 5, 4, 1, 1, 6, 7, 11

Squaring them and adding them together yields 362, dividing it by the total number of values gives the variance 40.22.

To get the standard deviation, take the square root and get: 6.34

### Statistics in Python
All above features are implemented in Python package Numpy.

In [4]:
import numpy as np

print("mean:", np.mean(ages))
print("median:", np.median(ages))
print("50th percentile (median):", np.percentile(ages, 50))
print("25th percentile:", np.percentile(ages, 25))
print("75th percentile:", np.percentile(ages, 75))
print("standard deviation:", np.std(ages))
print("variance:", np.var(ages))

mean: 23.0
median: 22.0
50th percentile (median): 22.0
25th percentile: 18.0
75th percentile: 29.0
standard deviation: 6.342099196813483
variance: 40.22222222222222


## Information
See [Sololearn Data Science](https://github.com/Edward-Ji/SololearnDataScience) for more content on data science libraries in Python.