# sci-kit learn
***
## 1) Introduction

Scikit-learn is a popular Python library used in the field of machine learning. It's integrated with other Python packages including NumPy, SciPy, matplotlib and Pandas [3]. Scikit-learn was initially developed as a Google summer of code project in 2007 by David Cournapeau, then further developed by Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel from the French Institute for Research in Computer Science and Automation. The first public release was in January 2010 [1]. Many industries from engineering to banking to social media depend heavily on machine learning as part of their day to day operations and one of the most common tools used in the application of machine learning is the scikit-learn Python package.

## 2) Machine Learning
In traditional computer software, the human developer writes every line of code that instructs the system to do different tasks[7]. 

In Machine learning, the model/algorithm has been taught how to perform certain tasks with a huge amount of data. So that whenever the model/algorithm finds a pattern similar to what it is trained on, it performs tasks automatically without external intervention.
In traditional programming, Data & Program is given to system and output is taken whereas in Machine Learning Data & Output is given to system and system produces Program. Let’s have a look at the image below:[7]

In computer science there is an ever evolving area known as machine learning. Albeit it an ever evolving area it's been studied for decades. In 1959, Arthur Samuel, a pioneer in the field of machine learning (ML) defined it as the “field of study that gives computers the ability to learn without being explicitly programmed”[5].In machine learning, algorithms are used to analyse data and in doing do look for patterns. The different data types that are analysed include but not limited to:

* Numbers 
* Words 
* Images 
* Clicks 

Through data analysis and statistics the machine learns, and based on past conditions it can make predictions about the future [2]. Machine learning is so interesting and powerful because anything that can be digitally stored can be fed into a machine learning algorithm. Another interesting part of machine learning is how common it is, it's used on systems such as Netflix and You-Tube to recommend what to watch, also on search engines such as Google and Social media platforms such as Instagram and Facebook for advertising [4]. 

With all these systems the machine is collecting data around film genres that interest you, the music you listen to, the products you buy, what links you click, what posts you like or dislike, based on this past behaviour the machine can predict with confidence about what movie you'd like to watch next, the song you'd like to listen to next, where you'll spend your money and where you won't spend your money[4].

### 2.1 Machine learning methods
Machine learning can be sub divided into two areas; machine learning through supervised learning and machine learning through unsupervised learning. The key difference between the two is that data used in supervised learning the data is labeled and for unsupervised the data is unlabeled. Labeled data is data that has the answer that the machine learning model has to predict [6]. From Figure 1 it can be seen that labelled data was used to train the machine, the data was labelled as "cats" based on the algorithm the machine was able to predict correctly when presented with 4 animals, two of which were cats. The following sections will discuss in more detail supervised and unsupervised learning.

![Supervised_machine_learning.png](attachment:Supervised_machine_learning.png)**Figure 1 Labelled data for supervised learning**

#### 2.1.1 Supervised learning
99% is supervised



##### Classification

* k nearest neighbors
* support vector machines
* 

##### Clustering

##### Regression

* logistic regression
* linear regression




#### 2.1.2 Unsupervised learning

## Machine learning algorithms in scikit-learn

In [None]:
import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np
import pandas as pd

## Development
***
### Regression - Diabetes dataset  (Skip if analysing Boston Housing DataSet)

In [None]:
# Load dataset from scikit-learn dataset library
diabetes = datasets.load_diabetes()
print()
print('Dataset shape:',diabetes.data.shape)
print()
print('Diabetes labels shape:',diabetes.target.shape)
print()
print('Diabetes feature names:',diabetes.feature_names)
print()
print()
print('|------------------------------------------- DataSet Description -------------------------------------------|')
print()
print(diabetes.DESCR)

df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df["target"] = diabetes.target
df.head()

X,Y = diabetes = datasets.load_diabetes(return_X_y=True)

### Regression - Boston Housing dataset (Skip if analysing Diabetes DataSet)

In [None]:
boston = datasets.load_boston()
print()
print('Dataset shape:',boston.data.shape)
print()
print('Diabetes labels shape:',boston.target.shape)
print()
print('Diabetes feature names:',boston.feature_names)
print()
print()
print('|------------------------------------------- DataSet Description -------------------------------------------|')
print()
print(boston.DESCR)

df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df["target"] = diabetes.target
df.head()

X,Y = boston = datasets.load_boston(return_X_y=True)

### Analysis on the chosen dataset starts here, be sure that correct dataset is loaded!

In [None]:
X.shape, Y.shape

### Splitting the data

In [None]:
from sklearn.model_selection import train_test_split # import library

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
X_train.shape, Y_train.shape

In [None]:
X_test.shape, Y_test.shape

### Linear Regression Model

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
model = linear_model.LinearRegression() # define the model
model.fit(X_train, Y_train) # train the model

In [None]:
Y_pred = model.predict(X_test) # predict Y values using the X_test inputs

### Predict the output

In [None]:
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean squared error (MSE): %.3f' % mean_squared_error(Y_test, Y_pred))
print('Coefficient of determination (R^2): %.3f' % r2_score(Y_test, Y_pred))

## Scatter plots

In [None]:
import seaborn as sns

In [None]:
Y_test

In [None]:
Y_pred

In [None]:
sns.scatterplot(Y_test, Y_pred)

# References
[4] https://www.technologyreview.com/2018/11/17/103781/what-is-machine-learning-we-drew-you-another-flowchart/

[5] https://www.udemy.com/course/scikit-learn-in-python-for-machine-learning-engineers/

[6] https://www.cloudfactory.com/data-labeling-guide#:~:text=What%20is%20labeled%20data%3F,machine%20learning%20model%20to%20predict.

[7] https://medium.com/analytics-vidhya/most-used-scikit-learn-algorithms-part-1-snehit-vaddi-7ec0c98e4edd


# End