# InClass Group Activity, Week 19, Heather Leighton-Dick

## 1. Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. Reference: https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/


In [1]:
import pandas as pd
import numpy as np

diabetes_df = pd.read_csv("../Homework14/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [47]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1).values
y = diabetes_df['Outcome'].values

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

### PCA

In [55]:
# evaluate pca with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

# define the pipeline
steps = [('pca', PCA(n_components=8)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.775 (0.041)


Principal Component Analysis is the process of creating first principal and second principal feature vectors which describe the greatest amount of variance among the samples. Reducing the n_splits and changing the model from LogisticRegression to NaiveBayes both reduced the model's accuracy. Adding a train-test-split didn't change the accuracy.

### LDA

In [54]:
# evaluate lda with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

# define the pipeline
steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', GaussianNB())]
model = Pipeline(steps=steps)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.772 (0.042)


Linear Discriminant Analysis is used for multiclass classification. Specifically, LDA tries to find a combination of inputs that gives the maximum distance between classes of samples and the minimum distance between samples within each class. Results: reducing n_splits from 10 to 5 reduced the accuracy score by a hundredth, but changing the model from LogisticRegression to NaiveBayes increased the accuracy from 0.65 to 0.772. Incorporating a train-test-split didn't affect the accuracy one way or another.

### MMLE

In [50]:
# evaluate modified lle and logistic regression for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.linear_model import LogisticRegression

# define the pipeline
steps = [('lle', LocallyLinearEmbedding(n_components=8, method='modified', n_neighbors=10)), ('m', GaussianNB())]
model = Pipeline(steps=steps)

#model.fit(X_train, y_train)
#y_pred = model.predict(X_test)

# evaluate model
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.721 (0.027)


Feature embedding groups like values within a sparse feature and returns their locations, which has the effect of concentrating the importance of that feature. A sparse feature can be a categorical feature consisting of lots of 0's and a few 1's (as in the "get dummies" transformation or one-hot-encoding), or it could be a feature made up of fairly unique values (ID numbers, for exammple). When embedding is used on structured data in a neural network, it is usually implemented in the first hidden layer alongside the "normal" features.

Changing n_splits didn't have any effect, but changing the model to NaiveBayes boosted accuracy from 0.652 to 0.722. Adding a train-test-split pulled the accuracy score down.


https://towardsdatascience.com/why-you-should-always-use-feature-embeddings-with-structured-datasets-7f280b40e716

### Conclusion: All three models performed similarly, despite parameter tuning.

## 2. Write a function that will indicate if an inputted IPv4 address is accurate or not.
IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated by periods.

Input 1:
2.33.245.5

Output 1:
True

Input 2:
12.345.67.89

Output 2:
False

In [61]:
len(str('2.33.245.5').split('.'))

4

In [60]:
len(str('12.345.67.89').split('.'))

4

In [39]:
#group efforts, especially Madison!

def accurate(ip):
    val = str(ip).split('.')
    if len(val)==4:
            for (i) in val:
                if int(i) < 0 or int(i) > 255:
                    return False
    else:
            return False
    return True

In [41]:
accurate('2.33.245.5')

True

In [56]:
accurate('12.345.67.89')

False

In [71]:
#my attempt, not that much different

def address_true(address):
    address = str(address).split('.')
    for i in address:
        if int(i) >= 0 and int(i) <= 255 and len(address) == 4:
            return True
        elif int(i) >= 0 and int(i) <= 255 and len(address) != 4:
            return False
        else:
            return False

In [75]:
address_true('2.33.245.5')

True

In [73]:
accurate('12.345.67.89')

False