# This Notebook consists of ML related challenging tasks. 
## Topics covered  – Association Rules, PCA & KNN 


### * Task 1
> Write a Python program using Scikit-learn to split the iris dataset into 80% train
    data and 20% test data. Out of total 150 records, the training set will contain 120
    records and the test set contains 30 of those records. Train or fit the data into the
    model and calculate the accuracy of the model using the K Nearest Neighbour
    Algorithm. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

In [2]:
from sklearn import datasets
iris = datasets.load_iris(as_frame=True)

In [3]:
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [4]:
iris['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
iris['data'].head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [6]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2)

In [7]:
print("X_train = ", X_train.shape)
print("Y_train = ", Y_train.shape)
print("X_test = ", X_test.shape)
print("Y_test = ", Y_test.shape)

X_train =  (120, 4)
Y_train =  (120,)
X_test =  (30, 4)
Y_test =  (30,)


In [8]:
def KNN_classifier(k=3):
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, Y_train)

    predicted = knn.predict(X_test)
    accuracy = metrics.accuracy_score(Y_test, predicted)
    return accuracy

accuracy_KNN = KNN_classifier()
print ("Accuracy of KNN model with k = 3 is =", accuracy_KNN)

Accuracy of KNN model with k = 3 is = 0.9666666666666667


### * Task 2
> Further train or fit into the model and calculate the performance for different
    values of k? (Use IRIS dataset from sk learn datasets)

In [9]:
for i in range(1, 15):
    accuracy = KNN_classifier(i)
    print ("Accuracy of KNN model with k = %s is =" % i, accuracy)

Accuracy of KNN model with k = 1 is = 0.9666666666666667
Accuracy of KNN model with k = 2 is = 0.9333333333333333
Accuracy of KNN model with k = 3 is = 0.9666666666666667
Accuracy of KNN model with k = 4 is = 0.9666666666666667
Accuracy of KNN model with k = 5 is = 0.9666666666666667
Accuracy of KNN model with k = 6 is = 0.9666666666666667
Accuracy of KNN model with k = 7 is = 1.0
Accuracy of KNN model with k = 8 is = 1.0
Accuracy of KNN model with k = 9 is = 1.0
Accuracy of KNN model with k = 10 is = 1.0
Accuracy of KNN model with k = 11 is = 1.0
Accuracy of KNN model with k = 12 is = 1.0
Accuracy of KNN model with k = 13 is = 1.0
Accuracy of KNN model with k = 14 is = 0.9666666666666667


### * Task 3
>Further perform KNN classification on the above dataset and compare the
accuracy between to algorithms

In [10]:
from sklearn.cluster import KMeans
k_mean = KMeans(n_clusters=3)
k_mean.fit(X_train, Y_train)

predicted = k_mean.predict(X_test)
accuracy_k_mean = metrics.accuracy_score(Y_test, predicted)
print("K_Means accuracy for 3 cluster = %s, Whereas \n KNN accuracy for 3 clusters = %s" % (accuracy_k_mean, accuracy_KNN))

K_Means accuracy for 3 cluster = 0.36666666666666664, Whereas 
 KNN accuracy for 3 clusters = 0.9666666666666667


### * Task 4 - Theory

Differenece between Parametric and Non parametric learning algorithms

1. Parametric learning
    1. Works on the basis of a mapping function.
    2. Has fixed number of parameters which can be used.
    3. Works based on a predefined form -- Different for each algorith
    4. Only the value of coefficients in the mapping function changes based on the training data.
    5. These are generally simpler to understand and faster in processing
    6. Example
        1. Logistic Regression
        2. Liner Regression
        3. Naive Bayse



2. Non Parametric learning
    1. They do not make assumptions and do not work on the basis of mapping function
    2. They do not have fixed number of parameters.
    3. The number of parameters changes based on the amount and nature of input data
    4. The value of parameters are adjustable and changes based on input data
    5. These are complex algorithms and take longer to train and process.
    6. Example
        1. Simple Vector Machines
        2. KNN Clasiifier
   


### * Task 5 - Theory

Supervised vs Unsupervised Learning

1. Supervised Learning
    1. The data is well labelled in training phase
    2. Algorithm creates a mapping function between input data (X) and output variable (Y)
    3. Supervised learning is mainly used in Classification and regression problems.
    4. The algorithm uses sets of reference points through row labels and then gives output on test data based on these points.
    5. Supervised learning allows collecting data and produces data output from previous experiences.
    
    
    
2. Unsupervised learning
    1. Unsupervised learning is the training of a machine using information that is neither classified nor labeled.
    2. The algorithm acts on the information present in training data without guidance.
    3. The algorithm groups information according to similarities, patterns, and differences without any mapping function.
    4. Unsupervised learning is mainly used in Clustering and assosiation tasks.
    5. The goal of unsupervised learning is to find the hidden patterns and useful insights from the unknown dataset.

### * Task 6 - Theory
Steps involved in Machine Learning process according to KDnuggets website and this is in accordance to what was thought in the class.

1. Data Collection:

>The quantity & quality of your data dictate how accurate our model is
The outcome of this step is generally a representation of data (Guo simplifies to specifying a table) which we will use for training
Using pre-collected data, by way of datasets from Kaggle, UCI, etc., still fits into this step
 
 
2. Data Preparation:

>Wrangle data and prepare it for training
Clean that which may require it (remove duplicates, correct errors, deal with missing values, normalization, data type conversions, etc.)
Randomize data, which erases the effects of the particular order in which we collected and/or otherwise prepared our data
Visualize data to help detect relevant relationships between variables or class imbalances (bias alert!), or perform other exploratory analysis
Split into training and evaluation sets
 
3. Choose a Model:

>Different algorithms are for different tasks; choose the right one
 
4. Train the Model:

>The goal of training is to answer a question or make a prediction correctly as often as possible
Linear regression example: algorithm would need to learn values for m (or W) and b (x is input, y is output)
Each iteration of process is a training step
 
5. Evaluate the Model:

>Uses some metric or combination of metrics to "measure" objective performance of model
Test the model against previously unseen data
This unseen data is meant to be somewhat representative of model performance in the real world, but still helps tune the model (as opposed to test data, which does not)
Good train/eval split? 80/20, 70/30, or similar, depending on domain, data availability, dataset particulars, etc.
 
6. Parameter Tuning

>This step refers to hyperparameter tuning, which is an "artform" as opposed to a science
Tune model parameters for improved performance
Simple model hyperparameters may include: number of training steps, learning rate, initialization values and distribution, etc.
 
7. Make Predictions

>Using further (test set) data which have, until this point, been withheld from the model (and for which class labels are known), are used to test the model; a better approximation of how the model will perform in the real world

### * Task 7
> Perform dimensionality reduction using PCA on the US Arrests dataset
(enclosed herewith). What variance can be explained by PC1 & PC2


In [11]:
import os
os.listdir()

['.ipynb_checkpoints',
 'Assignment_3.ipynb',
 'Nikhil Naik - apriori_data - apriori_data.csv',
 'Nikhil Naik - USArrests.csv']

In [12]:
us_arrest = pd.read_csv("Nikhil Naik - USArrests.csv")
target = us_arrest.pop(us_arrest.columns[0])


In [13]:
from sklearn.preprocessing import StandardScaler
scaled_data = StandardScaler().fit_transform(us_arrest)


In [14]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
x_pca = pca.fit_transform(scaled_data)

In [15]:
pca.explained_variance_

array([2.53085875, 1.00996444])

In [16]:
pca.explained_variance_ratio_ #PC1 has captured 62% variance in the data while PC2 has captured 24% variance

array([0.62006039, 0.24744129])

### * Task 8
>Create Basic association rule manually. The 'database' below has four
transactions. What association rules can be found in this set, if the minimum
support (i.e coverage) is 60% and the minimum confidence (i.e., accuracy) is 80%?

>Trans_id Item list<br><br>
T1 {K, A, D, B}<br>
T2 {D, A C, E, B}<br>
T3 {C, A, B, E}<br>
T4 {B, A, D}<br>

 
Creating association rules - Manually

### **Transaction List**



 1  -  K  |  A  |  D  |  B  |
----------------------
 2  -  D  |  A  |  C  |  E  |  B  |
----------------------
 3  -  C  |  A  |  B  |  E  |
----------------------
 4  -  B  |  A  |  D  | 
----------------------

-----------------------------------


#### All 1-item set | Frequency

   K                 |       1
------------------------------------  

   A                 |       4
------------------------------------  

   D                 |       3
------------------------------------  
   C                 |       2
------------------------------------   
   E                 |       2
------------------------------------   
   B                 |       4
------------------------------------   


### Frequent Item set based on minimum support level ie. 60%
> Minimum 3 transaction required to cross the threshold
> Hence A, B, D pass first threshold

#### Frequent 1-item set | Frequency

A | 4
-------
B | 4
-------
D | 3
------


-----------------------------------
#### All  2 - item set | Frequency
A, B | 4
-----------
A, D | 3
-----------
B, D | 3
-----------

> **All three pass the minimum support threshold**

--------------------------------------------

#### All 3-item set | Frequency
A,B,D | 3
------------
> **It passes the minimum support threshold**
> ** Hence I = {A,B,D}
> ** Support(I) = 3/4 = 75%

### All non empty sets from A,B,D
>A , B , D , AB, AD, BD
>Confidence of each rule is given by Support(I) / Support(S)

#### Rule 1, S = {A}
>Confidence A -> (B,D) = 3 / 4 = 75%
**Does not pass**

#### Rule 2, S = {B}
> Confidence B -> (A,D) = 3 / 4 = 75%
 **Does not pass**

#### Rule 3, S = {D}
> Confidence D -> (A,B) = 3 / 3 = 100%
 **RULE PASSES MINIMUM CONFIDENCE**
 
#### Rule 4, S = {A,B}
> Confidence A,B -> (D) = 3 / 4 = 75%
 **Does not pass**
 
#### Rule 5, S = {A,D}
> Confidence A,D -> (B) = 3 / 3 = 100%
 **RULE PASSES MINIMUM CONFIDENCE**
 
#### Rule 6, S = {B,D}
> Confidence B,D -> (A) = 3 / 3 = 100%
 **RULE PASSES MINIMUM CONFIDENCE**

### Task 9
>Create Association Rules using Apriori algorithm using the dataset ‘apriori
data’) attached herewith. 


In [17]:
import csv
with open("Nikhil Naik - apriori_data - apriori_data.csv") as csv_file:
    data = csv.reader(csv_file)
    data = list(data)

In [18]:
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder() 

In [19]:
transformed = te.fit(data).transform(data)

In [20]:
final_data = pd.DataFrame(transformed, columns=te.columns_)

In [21]:
final_data

Unnamed: 0,apples,artichok,avocado,baguette,bordeaux,bourbon,chicken,coke,corned_b,cracker,ham,heineken,hering,ice_crea,olives,peppers,sardines,soda,steak,turkey
0,True,False,False,False,False,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False
1,False,False,False,True,False,True,False,True,False,False,True,False,False,True,True,False,False,False,False,True
2,False,False,False,False,False,True,False,False,True,False,True,False,True,False,True,True,False,False,False,True
3,True,False,True,True,False,True,False,False,False,False,False,False,False,True,False,True,True,False,False,False
4,True,False,False,True,False,False,False,False,False,True,False,True,True,False,False,True,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,False,False,True,False,False,True,False,False,False,True,False,True,False,False,True,False,False,True,False,False
997,True,False,True,True,False,False,False,False,False,False,False,False,False,True,True,True,True,False,False,False
998,False,False,False,False,False,False,True,True,True,False,False,True,False,True,False,True,True,False,False,False
999,True,False,False,True,False,False,False,False,True,False,False,False,True,False,True,True,False,False,True,False


In [22]:
from mlxtend.frequent_patterns import apriori, association_rules
def create_association_rules(support=0.1, confidence=0.01):
    freq_items = apriori(final_data, min_support=support, use_colnames=True)
    rules = association_rules(freq_items, metric="confidence", min_threshold=confidence)
    return rules, freq_items
    

In [23]:
rules, freq_items = create_association_rules()

In [24]:
freq_items

Unnamed: 0,support,itemsets
0,0.313686,(apples)
1,0.304695,(artichok)
2,0.362637,(avocado)
3,0.391608,(baguette)
4,0.402597,(bourbon)
...,...,...
214,0.102897,"(ham, turkey, hering, olives)"
215,0.114885,"(heineken, baguette, soda, hering, cracker)"
216,0.105894,"(olives, bourbon, soda, heineken, cracker)"
217,0.115884,"(ice_crea, coke, sardines, chicken, heineken)"


In [25]:
rules


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(apples),(avocado),0.313686,0.362637,0.138861,0.442675,1.220710,0.025107,1.143611
1,(avocado),(apples),0.362637,0.313686,0.138861,0.382920,1.220710,0.025107,1.112196
2,(apples),(baguette),0.313686,0.391608,0.146853,0.468153,1.195462,0.024011,1.143922
3,(baguette),(apples),0.391608,0.313686,0.146853,0.375000,1.195462,0.024011,1.098102
4,(apples),(corned_b),0.313686,0.390609,0.150849,0.480892,1.231132,0.028320,1.173918
...,...,...,...,...,...,...,...,...,...
1179,(ham),"(olives, turkey, corned_b, hering)",0.304695,0.111888,0.101898,0.334426,2.988934,0.067806,1.334355
1180,(turkey),"(ham, corned_b, hering, olives)",0.282717,0.117882,0.101898,0.360424,3.057495,0.068571,1.379223
1181,(olives),"(ham, turkey, corned_b, hering)",0.472527,0.101898,0.101898,0.215645,2.116279,0.053748,1.145019
1182,(corned_b),"(ham, turkey, hering, olives)",0.390609,0.102897,0.101898,0.260870,2.535247,0.061706,1.213727


### Task 10

>Show how the count of rules vary by changing Support & Confidence
thresholds. 


In [26]:
import random

for i in range(16):
    support = float(random.randint(5, 25))/100
    confidence = float(random.randint(1, 15))/100
    rules, freq_items = create_association_rules(support, confidence)
    print("Total rules when support = %s%% and confidence = %s%% is %s" % (int(support * 100), int(confidence * 100), rules.shape[0]))
    

Total rules when support = 12% and confidence = 1% is 484
Total rules when support = 10% and confidence = 12% is 1184
Total rules when support = 21% and confidence = 15% is 46
Total rules when support = 10% and confidence = 14% is 1184
Total rules when support = 13% and confidence = 12% is 278
Total rules when support = 21% and confidence = 11% is 46
Total rules when support = 16% and confidence = 8% is 98
Total rules when support = 12% and confidence = 11% is 484
Total rules when support = 23% and confidence = 1% is 32
Total rules when support = 9% and confidence = 8% is 1700
Total rules when support = 20% and confidence = 11% is 60
Total rules when support = 25% and confidence = 8% is 14
Total rules when support = 14% and confidence = 7% is 182
Total rules when support = 8% and confidence = 3% is 1778
Total rules when support = 6% and confidence = 8% is 1918
Total rules when support = 18% and confidence = 14% is 72
