![polina](https://raw.githubusercontent.com/PolinaRus/CIS_8795/master/polina.png)

#What is Machine Learning?
Machine learning is a set of techniques, which help in dealing with vast data in the most intelligent fashion (by developing algorithms or set of logical rules) to derive actionable insights (delivering search for users in this case).

#What are the steps used in Machine Learning?

There are 5 basic steps used to perform a machine learning task:

* **Collecting data:** Be it the raw data from excel, access, text files etc., this step (gathering past data) forms the foundation of the future learning. The better the variety, density and volume of relevant data, better the learning prospects for the machine becomes.
* **Preparing the data:** Any analytical process thrives on the quality of the data used. One needs to spend time determining the quality of data and then taking steps for fixing issues such as missing data and treatment of outliers. Exploratory analysis is perhaps one method to study the nuances of the data in details thereby burgeoning the nutritional content of the data.
* **Training a model:** This step involves choosing the appropriate algorithm and representation of data in the form of the model. The cleaned data is split into two parts – train and test (proportion depending on the prerequisites); the first part (training data) is used for developing the model. The second part (test data), is used as a reference.
* **Evaluating the model:** To test the accuracy, the second part of the data (holdout / test data) is used. This step determines the precision in the choice of the algorithm based on the outcome. A better test to check accuracy of model is to see its performance on data which was not used at all during model build.
* **Improving the performance:** This step might involve choosing a different model altogether or introducing more variables to augment the efficiency. That’s why significant amount of time needs to be spent in data collection and preparation

#What are the types of Machine Learning algorithms?

####Supervised Learning / Predictive models:
Predictive model as the name suggests is used to predict the future outcome based on the historical data. Predictive models are normally given clear instructions right from the beginning as in what needs to be learnt and how it needs to be learnt. These class of learning algorithms are termed as Supervised Learning.

For example: Supervised Learning is used when a marketing company is trying to find out which customers are likely to churn. We can also use it to predict the likelihood of occurrence of perils like earthquakes, tornadoes etc. with an aim to determine the Total Insurance Value. Some examples of algorithms used are: Nearest neighbour, Naïve Bayes, Decision Trees, Regression etc.


####Unsupervised learning / Descriptive models:
It is used to train descriptive models where no target is set and no single feature is important than the other. The case of unsupervised learning can be: When a retailer wishes to find out what are the combination of products, customers tends to buy more frequently. Furthermore, in pharmaceutical industry, unsupervised learning may be used to predict which diseases are likely to occur along with diabetes. Example of algorithm used here is: K- means Clustering Algorithm

#Regression
Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator seeks to examine the causal effect of one variable upon another. It attempts to determine the strength of the relationship between one dependent variable and a series of other changing variables known as independent variables. The two basic types of regression are linear regression and multiple regression. Linear regression uses one independent variable to explain and/or predict the dependent variable, while multiple regression uses two or more independent variables to predict the outcome. 

The general form of each type of regression is: 

* **Linear Regression:** Y = a + bX + u
* **Multiple Regression:** Y = a + b1X1 + b2X2 + b3X3 + ... + btXt + u

Where:
* Y= the variable that we are trying to predict
* X= the variable that we are using to predict Y
* a= the intercept
* b= the slope
* u= the regression residual

## Load Data Using curl
* **Line 1:** Use **%sh** - This allows you to execute shell code in notebook
* **Line 2:** Make a new directory called **cis** on the machine
* **Line 3:** Use **curl** to download a file and locate it in the **cis** directory by assigning a name and format *(in the example below, the format is changed to **CSV**)*
* **Line 4:** Check if the file is downloaded and listed in the directory

In [6]:
%sh 
mkdir -p life_pred
curl 'https://raw.githubusercontent.com/PolinaRus/CIS_8795/master/Life_Expectancy.csv' > life_pred/life_expectancy.csv
ls /databricks/driver/life_pred

### Check File Directory
* Use **%fs** to check file path, name, and size

In [8]:
%fs 
ls file:/databricks/driver/life_pred

### Create Dataframe
* **Line 1:** Set File Path
* **Line 2:** Create a Dataframe **data**
* **Line 3:** Count rows in the Dataframe **data**
* Line 3: Show the Newly Created Dataframe **data**

In [10]:
data = spark.read.format("com.databricks.spark.csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("file:/databricks/driver/life_pred/life_expectancy.csv")
data.cache()  # Cache data for faster reuse
data.count()

Display Newly Created Dataframe **data**

In [12]:
display(data)

Print the schema of the dataframe **data**

In [14]:
data.printSchema()

Drop rows with missing values and count the rows again

In [16]:
data = data.dropna() 
data.count()

Create table that lets us access the table from our SQL notebook!

In [18]:
data.createOrReplaceTempView("life_exp")

Display the newly created table **life_exp**

In [20]:
%sql select * from life_exp

In [21]:
%sql
select Population, Life_Expectancy from life_exp

In [22]:
%sql
select Labor_Force as Labor_Force, Life_Expectancy from life_exp

In [23]:
%sql
select GDP, Life_Expectancy from life_exp

In [24]:
%sql
select Urbanization, Life_Expectancy from life_exp

In [25]:
%sql
select Literacy, Life_Expectancy from life_exp

In [26]:
%sql
select Below_Poverty_Line, Life_Expectancy from life_exp

In [27]:
%sql
select Median_Age, Life_Expectancy from life_exp

In [28]:
# Create DataFrame with just the data we want to run linear regression
df = spark.sql("select GDP, Life_Expectancy as label from life_exp")
display(df)

In [29]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["GDP"],
    outputCol="features")
output = assembler.transform(df)
display(output.select("features", "label"))

In [30]:
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression

# Define LinearRegression algorithm
lr = LinearRegression()

# Fit 2 models, using different regularization parameters
modelA = lr.fit(output, {lr.regParam:0.0})
modelB = lr.fit(output, {lr.regParam:100.0})

In [31]:
print ">>>> ModelA intercept: %r, coefficient: %r" % (modelA.intercept, modelA.coefficients[0])

In [32]:
print ">>>> ModelB intercept: %r, coefficient: %r" % (modelB.intercept, modelB.coefficients[0])

In [33]:
# Make predictions
predictionsA = modelA.transform(output)
display(predictionsA)

In [34]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(predictionsA)
print("ModelA: Root Mean Squared Error = " + str(RMSE))

In [35]:
predictionsB = modelB.transform(output)
RMSE = evaluator.evaluate(predictionsB)
print("ModelB: Root Mean Squared Error = " + str(RMSE))

In [36]:
import numpy as np
from pandas import *
from ggplot import *

GDP = output.rdd.map(lambda p: (p.features[0])).collect()
Life_Expectancy = output.rdd.map(lambda p: (p.label)).collect()
predA = predictionsA.select("prediction").rdd.map(lambda r: r[0]).collect()
predB = predictionsB.select("prediction").rdd.map(lambda r: r[0]).collect()

pydf = DataFrame({'GDP':GDP,'Life_Expectancy':Life_Expectancy,'predA':predA, 'predB':predB})

In [37]:
pydf

In [38]:
pp = ggplot(pydf, aes('GDP','Life_Expectancy'))  + \
    geom_point(color='blue') 
display(pp)


In [39]:
p = ggplot(pydf, aes('GDP','Life_Expectancy')) + \
    geom_point(color='blue') + \
    geom_line(pydf, aes('GDP','predA'), color='red') + \
    geom_line(pydf, aes('GDP','predB'), color='green') + \
    scale_x_log10() + scale_y_log10()
display(p)