## 1: Face Recognition, but not evil this time

Using the faces dataset in:

```
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
```

If you use the `faces.target` and `faces.target_names` attributes, you can build a facial recognition algorithm.

Use sklearn **gridsearch** (or an equivalent, like random search) to optimize the model for accuracy. Try both a SVM-based classifier and a logistic regression based classifier (with a feature pipeline of your choice) to get the best model. You should have at least 80% accuracy.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_lfw_people
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [13]:
faces = fetch_lfw_people(min_faces_per_person=60)
X = faces.data
y = faces.target

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [10]:
n_comps = 150

param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }

pipe = Pipeline([
    ('pca', PCA(n_components = n_comps, svd_solver ='randomized',
            whiten = True)),
    ('std', StandardScaler()),
    ('clf', GridSearchCV(SVC(kernel ='rbf', class_weight ='balanced'), param_grid))
])


In [16]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('pca',
                 PCA(copy=True, iterated_power='auto', n_components=150,
                     random_state=None, svd_solver='randomized', tol=0.0,
                     whiten=True)),
                ('std',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('clf',
                 GridSearchCV(cv=None, error_score=nan,
                              estimator=SVC(C=1.0, break_ties=False,
                                            cache_size=200,
                                            class_weight='balanced', coef0=0.0,
                                            decis...
                                            degree=3, gamma='scale',
                                            kernel='rbf', max_iter=-1,
                                            probability=False,
                                            random_state=None, shrinking=True,
                                            tol

In [17]:
y_pred = pipe.predict(X_test)
accuracy_score(y_test, y_pred)

0.8516320474777448

# 2: Bag of Words, Bag of Popcorn

By this point, you are ready for the [Bag of Words, Bag of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) competition. 

Use NLP feature pre-processing (using, SKLearn, Gensim, Spacy or Hugginface) to build the best classifier you can. Use a  feature pipeline, and gridsearch for your final model.

A succesful project should get 90% or more on a **holdout** dataset you kept for yourself.

In [131]:
df = pd.read_csv('/Users/kalebmckenzie/Documents/GitHub/5-1-predictive-modelling/labeledTrainData.tsv', sep='\t')
df = df.drop(['id'],axis=1)

In [133]:
#Data and text cleaning
from bs4 import BeautifulSoup
def remove_html(text):
    bs = BeautifulSoup(text, "html.parser")
    return ' ' + bs.get_text() + ' '

df['review'] = df['review'].apply(remove_html)

def keep_only_letters(text):
    text=re.sub(r'[^a-zA-Z\s]','',text)
    return text

df['review'] = df['review'].apply(keep_only_letters)

def convert_to_lowercase(text):
    return text.lower()

df['review'] = df['review'].apply(convert_to_lowercase)

def clean_reviews(text):
    text = remove_html(text)
    text = keep_only_letters(text)
    text = convert_to_lowercase(text)
    return text

df['review'] = df['review'].apply(clean_reviews)

In [134]:
english_stop_words = nltk.corpus.stopwords.words('english')

In [135]:
def remove_stop_words(text):
    for stopword in english_stop_words:
        stopword = ' ' + stopword + ' '
        text = text.replace(stopword, ' ')
    return text
 
df['review'] = df['review'].apply(remove_stop_words)

In [136]:
def text_stemming(text):
    stemmer = nltk.porter.PorterStemmer()
    stemmed = ' '.join([stemmer.stem(token) for token in text.split()])
    return stemmed
 
df['review'] = df['review'].apply(text_stemming)

In [140]:
#splitting the data into a training set and test set
X_train = df[:20000]
y_test = df[20000:]

In [146]:
import sklearn
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,2))

tf_features_train = vectorizer.fit_transform(X_train['review'])

tf_features_test = vectorizer.transform(y_test['review'])

print (tf_features_train.shape, tf_features_test.shape)

(20000, 1447824) (5000, 1447824)


In [147]:
train_labels = [1 if sentiment== 1 else 0 for sentiment in X_train['sentiment']]
test_labels = [1 if sentiment== 1 else 0 for sentiment in y_test['sentiment']]
print (len(train_labels), len(test_labels))

20000 5000


In [171]:
from sklearn.linear_model import LogisticRegression
pipes = Pipeline([
    ('reg', LogisticRegression())
])

In [172]:
pipes.fit(tf_features_train, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('reg',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [173]:
predictions = pipes.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
print(accuracy_score(test_labels, predictions))

              precision    recall  f1-score   support

           0       0.89      0.88      0.88      2472
           1       0.88      0.89      0.89      2528

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000

[[2176  296]
 [ 274 2254]]
0.886


In [174]:
print(accuracy_score(test_labels, predictions))

0.886
