# Chapter 01- The Machine Learning Landscape

In this chapter I have covered some of the most important concepts of machine learning. Like-
    1. Machine Learning concepts and Why we use it.
    2. Different types of ML-
        * Supervised Learning
        * Unsupervised Learning
        * Semi-supervised Learning
        * Reinforcement Learning  
    3. Batch and Online Learning
    4. Instance based and model Based learning
    5. Main challenges of Machine Learning
        * Bad Data-
            > Insufficient quantity of training data
            > Non representative training data
            > Poor quality data
        * Bad Algorithm
            > Overfitting of the training data
            > Underfitting the training data
    6. Testing and Validation

### Example 1-1

In [None]:
# Import some common libraries
import pandas as pd
import numpy as np
import os

# to plot some pretty figures
import matplotlib as mpl
import matplotlib.pyplot as plt

# Ignore useless warnings 
import warnings
warnings.filterwarnings(action = 'ignore')

In [None]:
# Load the data
datapath = os.path.join("datasets", "lifesat", "")
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter='\t',
                             encoding='latin1', na_values="n/a")

In [None]:
# Defining Function
def prepare_country_stats(oecd_bli, gdp_per_capita):
    # it merges the GDP and Life satisfaction data into a single pandas data Frame
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

In [None]:
# prepare the data
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

In [None]:
# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

In [None]:
# Select a linear model
import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[22587]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

In [None]:
# tring a k neighbors regressor
from sklearn.neighbors import KNeighborsRegressor

In [None]:
model2 = KNeighborsRegressor(n_neighbors=5)

In [None]:
model2.fit(X,y)

In [None]:
print(model.predict(X_new))

# Exercise Solution

## 1. How would you define Machine Learning?
    
    Machine Learning is the science and art of computer programming so we can teach from Data rather than explicitly programmed.

## 2.Can you name 4 types of problems where it shines?
    
    Machine learning gives us the opportunity to programmed without writing tons of rules. We can simply train machine with data. ML is great for-
        * problems where we need a lot of hand tuning or long list of rules. like- spam filter
        * complex problem where there is no traditional solution. like- image recognition
        * fluctuate environments. Like- spam filter
        * getting insights about complex problems and large amounts of data.


## 3. What is a labeled training set?
    
    Training dataset with desired output is called labeled training set.

## 4.What are the two most common supervised task?
    
    The two most common supervised task are- Classification and Regression (Predict a target variable)

## 5.Can you name 4 common unsupervised task?
    
    4 common unsupervised tasks are-
        1. Clustering
        2. Visualization
        3. Dimensionality reduction
        4. Association rule learning

## 6.What type of ML algorithm would you use to allow a robot to walk in various unknown terrains?
    Reinforcement Learning

## 7.What types of algorithm would you use to segment your customers into multiple groups?

    If data is labeled then we can use classification algorithm (supervised learning), otherwise we can use clustering algorithm(unsupervised learning)

## 8. Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning algorithm?
    
    Spam detection problem is a supervised learning algorithm cause the training data set is labeled with spam or ham

## 9.What is an online learning system?
    
    In online learning, the system is trained incrementally feeding it data instances sequentially, either individually or by small groups called mini-batches.

## 10.What is out of core learning?
    
    Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine's main memory (RAM). This is called out-of-core learning. The algorithm loads part of data, runs a training step on that data, and repeats the process until it has run all of the data.

## 11.What type of learning algorithm relies on a similarity measure to make predictions?

    Instance based learning algorithm relies on a similarity measure to make predictions.

## 12.What is the difference between a model parameter and a learning algorithm's hyperparameter?

    Model Parameters are something that a model learns on its own. Model hyper-parameters are used to optimize the model performance.

## 13.What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?

    Model based learning algorithm search for the optimal value of parameters in a model that will give the best results for the new instances. We often use a cost function or similar to determine what the parameter value has to be in order to minimize the function. The model makes prediction by using the value of the new instance and the parameters in its function.

## 14.Can you name four of the main challenges in Machine Learning?

    * Insufficient quantity of training data
    * Non-representative training data
    * Poor quality data
    * Irrelevant features
    * Overfitting the training data
    * Underfitting the training data

## 15.If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?

    If the model performs poorly to new instances, then it has overfit on the training data. To solve this, we can do any of the following three: 
    * to gather more data.
    * Fix data, removes outliers
    * To simplify the model by selecting one with less parameters.

## 16.What is a test set and why would you want to use it?

    A test set is used to estimate the generalization error that a model will make on new instances, before launching the model in production. The error rate on new cases is called the generalization error.

## 17.What is the purpose of a validation set?

    A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.

## 18.What can go wrong if you tune hyperparameters using the test set?

    If I tune hyperparameters using the test set, I risk overfitting (less training loss but large validation error) the test set. The generalization error you measure will be optimistic (Launch model will perform worse than expectation)

## 19.What is cross-validation and why would you prefer it to a validation set?

    Cross-validation is a technique that makes it possible to compare models (For model selection and hyperparameter tuning) without the need for a separate validation set. This saves precious training data. 