# Week 19 Group Activity

## 1. Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. 
Reference:
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

In [23]:
# Dependencies and Modules

import numpy as np
import pandas as pd
from sklearn import metrics

import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap

# diabetes csv:
diabetes_df = pd.read_csv("diabetes.csv")

# Separate data into input and output variables
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split up data into training and test dates:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

### Linear Discriminant Analysis

In [22]:
# set the parameters
steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# Report
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Accuracy: 0.772 (0.043)


This score is already better than what we get in class with just LogisticRegression. Linear Discriminant Analysis is appropriate for this dataset because it is a multi-class classification algorithm. It works by limiting the number of dimensions for the projection to 1, Class-1. 

### Principal Component Analysis

In [18]:
# evaluate pca with logistic regression for classification

# define the pipeline
steps = [('pca', PCA(n_components=8)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.773 (0.043)


PCA is the most common reduction technique for classification problems. The parameter n_components defines number of desired dimensions. We see a small improvement in our metrics.

### Isomap

In [19]:
# evaluate isomap with logistic regression algorithm for classification

# define the pipeline
steps = [('iso', Isomap(n_components=8)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.725 (0.048)


Isomap attempts to preserve the relationship within the data. It caused the score to drop a little.

## 2. Write a function that will indicate if an inputted IPv4 address is accurate or not. IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated by periods.
**Input 1:**\
2.33.245.5\
**Output 1:**\
True

**Input 2:**\
12.345.67.89\
**Output 2:**\
False

In [14]:
#define function
def accurate(ip):
    val= str(ip).split('.') #identify the punctuation
    if len(val) == 4: #set length of value
        for i in val:
            if int(i)< 0 or int(i) > 255: #0 and 255 inclusive
                return False
    else:
        return False
    return True

In [15]:
accurate('2.33.245.5')

True

In [24]:
accurate('3.55.2343.7')

False