# <center>Regression & Classification Session (with some other AI & Data Science content)</center>
## <center>TSTT Data Science Initiative</center>

<center>Presented by Rafael Guerrero</center>

<center>Navin Dookeram, Darren Ramsook, Mariella Rivas, Gabriela Sewdhan</center>

<br>
<br>
<br>
<br>
<br>
<br>

### In this tutorial, we will be covering:

<br>

1. Some Common Definitions in AI & Data Science



2. Regression <br>
    i) Simple Linear Regression <br>
    ii) Multiple Linear Regression <br>
    iii) Multiple Regression



3. Classification <br>
    i) Logistic Regression <br>
    ii) ??

<br>
<br>
<br>
<br>
<br>

### 1. Some Common Definitions in AI & Data Science:

<br>

**Artificial Intelligence (AI):** Artifical intelligence can be disaggregated into two (2) separate categories or ideas:

1) **Artificial General Intelligence (AGI)** - AGI refers to machines that mimic all the tasks that humans can do and perhaps perform the tasks even better.

2) **Artificial Narrow Intelligence (ANI)** - ANI refers to machines that handle a single task. Examples of these include a smart speaker, a self-driving car, a recommender system or a system that identifies product defects in a production line.

<br>
Most of the progress these days have been in ANIs and not necessarily in AGIs, so you can rest easy knowing that there will be no robot invasions coming any time soon.

<br>
<br>

**Machine Learning** is a tool that is used in AI that gives the computers the ability to learn without being explicitly programmed to do so. It can be classified under three (3) categories or procedures:

1) **Supervised Learning** - In supervised learning, we are aware of the values or labels of the outputs for given inputs from the previous data that we have. The machine learns the input to output mappings based on these past records to predict future outputs of new data. One use case example is an email spam filter which detects, given an input of emails, what is spam and what is not spam. Previous data would be already classified emails as spam or not spam. Examples of supervised learning procedures include linear regression, multiple regression and logistic regression.

2) **Unsupervised Learning** - In unsupervised learning, we are not aware of the values or labels of the outputs for given inputs from the previous data that we have. We use the data to infer clusters or groups of the data-points by similar properties that are shared in the dataset and in some cases detect anomalies. Examples of unsupervised learning procedures include K-means clustering, hierarchical clustering, db scan clustering and anomaly detection.

3) **Reinforcement Learning** - reinforcement learning is semi-supervised learning in that some part of the data is labelled, and some part is not labelled for future actions. The model learns slowly by seeing past data and will learn and then apply its knowledge to the new data as it comes up. These procedures are applied to systems that perform games like chess and backgammon, and even in robotics. The machine will know what to do next but there is uncertainty about the future until the previous task is performed.

<br>
Supervised learning procedures are the most commonly used ones.

<br>
<br>

**Quick Quiz 1:**
*Another example of a supervised learning procedure is pricing or valuing houses based on their aesthetics. What would you consider as the inputs and outputs in this scenario? You can give some examples for inputs.*

**Quick Quiz 2:**
*What is a use case example of an unsupervised learning procedure?*

<br>
<br>
The input data can be structured, unstructured or semi-structured. Structured data can be seen as data that is organised in tables. For example, data that you would see in relational databases. Unstructured data would be data that is not organised in tables. Examples include images, audio records, video records and text records. There is also semi-structured data which comprises of some elements of structured data and some elements of unstructured data.

<br>
<br>

**Quick Quiz 3:**
*Would you classify the content in the body of emails as structured or unstructured data?*


<br>
<br>

**Data Science** - According to Wikipedia, “data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.” Data Science generally assesses hypotheses that are made on data for making business decisions. For instance, in assessing the value of houses you might notice that more bedrooms (within a certain range) may be more valuable than less bedrooms, even though the size of the house is the same. With further analyses of accounting for the extra cost to build extra bedrooms, a construction company can see what type of houses to build or a realtor will know how to value these houses, etc.
<br>
<br>

**Data Mining** – According to Wikipedia, “data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.”
<br>
<br>

**Deep Learning or Neural Networks** – Deep learning, also known as neural networks or artificial neural networks, is an effective machine learning algorithm that has applications in each of the learning procedures.

Some neural network that are used for supervised learning include:
- Convolutional Neural Networks (for natural language processing, and image and video recognition)
- Recurrent Neural Networks (for time-series, handwriting and speech recognition)

Some neural networks that are used for unsupervised learning include:
- Autoencoders
- Deep Belief Nets
- Hebbian Learning
- Generative adversarial networks
- Self-organizing map


<br>
<br>

**Sources:**
- Andrew NG: AI for everyone, Coursera (https://www.coursera.org/learn/ai-for-everyone/home/welcome)
- Krish Naik (https://www.youtube.com/watch?v=k2P_pHQDlp0)
- Wikipedia (https://en.wikipedia.org/wiki/Data_science) (https://en.wikipedia.org/wiki/Data_mining)




<br>
<br>


**Quick Quiz 1:**
*Another example of a supervised learning procedure is valuing houses based on their aesthetics. What examples would be considered as the inputs and outputs?*


<font color='green'>

**Answer 1:** Some inputs could be house features such as location rank, size of land in square feet, size of house in square feet, number of rooms, etc. All of these can be represented in tabular form. The outputs would be the price of the houses.

    
<br>

<font color='black'>

**Quick Quiz 2:**
*What is a use case example of an unsupervised learning procedure?*

<font color='green'>

**Answer 2:** For instance, you may have a group of animals without knowing which families they belong to. Based on the input properties that we may have on them such as whether they could fly, whether they could swim, type of skin, etc. they can be classfied into certain classes. We can even define the number of classes or groups for the machine learning algorithm to cluster into. Another example could be classifying customers from a company into various groups. The groups would depend on the features of the customers that are used.

<br>

<font color='black'>

**Quick Quiz 3:**
*Would you classify the content in the body of emails as structured or unstructured data?*

<font color='green'>

**Answer 3:** The text of emails would be considered as unstructured data. Emails on a whole can be considered as semi-structured data since there are some structured elements such as sender email address, receiver email address, date sent, time sent, subject, etc.

In [34]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline

Source: Udacity (https://www.youtube.com/watch?v=i04Pfrb71vk)


<font color='blue'>
Hey

## Simple Linear Regression
- Y = β<sub>0</sub> + β<sub>1</sub>x<sub>1</sub>

## Logistic Regression
P(Y = 1 | X) = tanh( β<sub>0</sub> + β<sub>1</sub>x<sub>1</sub> )

In [32]:
data = [[0, 4.2], [0, 5.1], [0, 5.5], [1, 8.2], [1, 9.0], [1, 9.1]]
pd.DataFrame(data, columns=["Admission", "GPA"])

Unnamed: 0,Admission,GPA
0,0,4.2
1,0,5.1
2,0,5.5
3,1,8.2
4,1,9.0
5,1,9.1


| Admission | GPA |
| --- | --- |
| 0 | 4.2 |
| 0 | 5.1 |
| 0 | 5.5 |
| 1 | 8.2 |
| 1 | 9.0 |
| 1 | 9.1 |

In [None]:
# Plot learning curve (with costs)
costs = np.squeeze(d['costs'])
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(d["learning_rate"]))
plt.show()