# Lambda School Data Science Unit 4 Sprint Challenge 4

## RNNs, CNNs, AutoML, and more...

In this sprint challenge, you'll explore some of the cutting edge of Data Science.

*Caution* - these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime on Colab or a comparable environment. If something is running longer, doublecheck your approach!

## Part 1 - RNNs

Use an RNN to fit a simple classification model on tweets to distinguish from tweets from Austen Allred and tweets from Weird Al Yankovic.

Following is code to scrape the needed data (no API auth needed, uses [twitterscraper](https://github.com/taspinar/twitterscraper)):

In [1]:
!pip install twitterscraper



In [2]:
from twitterscraper import query_tweets

austen_tweets = query_tweets('from:austen', 1000)
len(austen_tweets)

INFO: queries: ['from:austen since:2006-03-21 until:2006-11-14', 'from:austen since:2006-11-14 until:2007-07-11', 'from:austen since:2007-07-11 until:2008-03-05', 'from:austen since:2008-03-05 until:2008-10-30', 'from:austen since:2008-10-30 until:2009-06-25', 'from:austen since:2009-06-25 until:2010-02-19', 'from:austen since:2010-02-19 until:2010-10-15', 'from:austen since:2010-10-15 until:2011-06-11', 'from:austen since:2011-06-11 until:2012-02-04', 'from:austen since:2012-02-04 until:2012-09-30', 'from:austen since:2012-09-30 until:2013-05-26', 'from:austen since:2013-05-26 until:2014-01-20', 'from:austen since:2014-01-20 until:2014-09-15', 'from:austen since:2014-09-15 until:2015-05-12', 'from:austen since:2015-05-12 until:2016-01-05', 'from:austen since:2016-01-05 until:2016-08-31', 'from:austen since:2016-08-31 until:2017-04-26', 'from:austen since:2017-04-26 until:2017-12-21', 'from:austen since:2017-12-21 until:2018-08-16', 'from:austen since:2018-08-16 until:2019-04-12']
INFO

181

In [3]:
austen_tweets[0].text

'I love love love working with great people.pic.twitter.com/fCKOm6Vl'

In [4]:
al_tweets = query_tweets('from:AlYankovic', 1000)
len(al_tweets)

INFO: queries: ['from:AlYankovic since:2006-03-21 until:2006-11-14', 'from:AlYankovic since:2006-11-14 until:2007-07-11', 'from:AlYankovic since:2007-07-11 until:2008-03-05', 'from:AlYankovic since:2008-03-05 until:2008-10-30', 'from:AlYankovic since:2008-10-30 until:2009-06-25', 'from:AlYankovic since:2009-06-25 until:2010-02-19', 'from:AlYankovic since:2010-02-19 until:2010-10-15', 'from:AlYankovic since:2010-10-15 until:2011-06-11', 'from:AlYankovic since:2011-06-11 until:2012-02-04', 'from:AlYankovic since:2012-02-04 until:2012-09-30', 'from:AlYankovic since:2012-09-30 until:2013-05-26', 'from:AlYankovic since:2013-05-26 until:2014-01-20', 'from:AlYankovic since:2014-01-20 until:2014-09-15', 'from:AlYankovic since:2014-09-15 until:2015-05-12', 'from:AlYankovic since:2015-05-12 until:2016-01-05', 'from:AlYankovic since:2016-01-05 until:2016-08-31', 'from:AlYankovic since:2016-08-31 until:2017-04-26', 'from:AlYankovic since:2017-04-26 until:2017-12-21', 'from:AlYankovic since:2017-12

960

In [5]:
al_tweets[0].text

"Hey @suzanneyankovic, where'd I leave my shoes?"

In [6]:
len(austen_tweets + al_tweets)

1141

Your tasks:

- Encode the characters to a sequence of integers for the model
- Get the data into the appropriate shape/format, including labels and a train/test split
- Use Keras to fit a predictive model, classifying tweets as being from Austen versus Weird Al
- Report your overall score and accuracy

For reference, the [Keras IMDB sentiment classification example](https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py) will be useful, as well the RNN code we used in class.

*Note* - focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done!

In [7]:
# Imports
import numpy as np
# from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from random import sample
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [8]:
# Preprocess data: encoding, shaping, train/test split
# encoding
austen_text = ''
al_text = ''


def process_tweets(text, tweets):
    for tweet in tweets:  # austen_tweets, al_tweets
        try:
            text += '\n\n' + tweet.text  # austen_text, al_text
        except:
            print('Failed: ' + tweet.text)

    text = text.split('\n\n')
    return text


def encode_tweets(text):    
    chars = list(set(text)) # split and remove duplicate characters. convert to list.

    num_chars = len(chars) # the number of unique characters
    txt_data_size = len(text)
    print("unique characters : ", num_chars)
    print("txt_data_size : ", txt_data_size)
    # one hot encode
    char_to_int = dict((c, i) for i, c in enumerate(chars)) # "enumerate" retruns index and value. Convert it to dictionary
    int_to_char = dict((i, c) for i, c in enumerate(chars))
    # print(char_to_int)
    # print("----------------------------------------------------")
    # print(int_to_char)
    # print("----------------------------------------------------")
    # integer encode input data
    integer_encoded = [char_to_int[i] for i in text] # "integer_encoded" is a list which has a sequence converted from an original data to integers.
    print(integer_encoded)
    print("----------------------------------------------------")
    print("data length : ", len(integer_encoded))
    return integer_encoded

In [9]:
austen_text_procd = encode_tweets(process_tweets(austen_text, austen_tweets))

unique characters :  224
txt_data_size :  227
[0, 185, 151, 13, 124, 30, 63, 86, 108, 69, 82, 61, 163, 65, 32, 27, 61, 159, 70, 216, 20, 84, 1, 18, 167, 186, 62, 23, 103, 131, 132, 40, 39, 178, 177, 64, 137, 127, 36, 49, 198, 194, 160, 207, 140, 72, 52, 126, 125, 31, 19, 21, 138, 206, 99, 181, 142, 109, 35, 11, 111, 50, 107, 75, 195, 44, 29, 10, 143, 4, 67, 105, 211, 200, 183, 8, 121, 98, 154, 90, 6, 60, 132, 205, 156, 93, 74, 7, 155, 78, 91, 115, 148, 203, 187, 94, 144, 199, 15, 76, 172, 87, 175, 96, 68, 214, 223, 141, 85, 123, 189, 150, 149, 73, 147, 2, 33, 92, 24, 208, 174, 79, 128, 66, 201, 71, 89, 165, 153, 139, 217, 114, 161, 97, 3, 80, 162, 72, 81, 168, 38, 190, 17, 9, 119, 182, 55, 188, 170, 209, 213, 221, 222, 45, 77, 204, 171, 210, 22, 53, 51, 129, 164, 34, 176, 101, 192, 116, 202, 83, 113, 37, 212, 102, 54, 122, 130, 152, 59, 5, 117, 46, 41, 57, 56, 100, 104, 118, 179, 106, 12, 193, 136, 215, 169, 58, 166, 184, 47, 95, 145, 220, 173, 157, 110, 191, 158, 218, 25, 43, 26, 196,

In [10]:
al_text_procd = encode_tweets(process_tweets(al_text, al_tweets))

unique characters :  965
txt_data_size :  965
[0, 561, 527, 870, 875, 253, 287, 453, 104, 948, 788, 210, 221, 143, 419, 738, 69, 66, 350, 594, 9, 299, 951, 676, 722, 954, 97, 256, 82, 912, 175, 399, 665, 715, 697, 853, 212, 958, 891, 501, 243, 699, 356, 347, 482, 306, 296, 719, 375, 447, 774, 409, 334, 421, 59, 251, 765, 959, 789, 442, 631, 150, 925, 752, 55, 369, 859, 928, 274, 578, 110, 135, 844, 292, 111, 475, 286, 124, 354, 321, 247, 690, 489, 342, 600, 916, 423, 93, 691, 12, 663, 208, 263, 480, 327, 105, 898, 726, 748, 492, 268, 205, 498, 588, 41, 770, 54, 776, 679, 60, 240, 389, 881, 433, 359, 825, 365, 4, 427, 74, 619, 851, 170, 509, 705, 599, 326, 56, 605, 345, 328, 434, 805, 655, 629, 400, 504, 574, 817, 922, 876, 23, 44, 19, 165, 200, 759, 678, 29, 541, 325, 5, 961, 346, 962, 829, 64, 841, 24, 408, 68, 270, 545, 317, 382, 903, 861, 183, 219, 687, 653, 209, 737, 436, 546, 348, 157, 535, 257, 706, 203, 311, 945, 768, 874, 573, 675, 531, 777, 565, 390, 792, 87, 106, 85, 543, 162

In [11]:
x_train, x_test, y_train, y_test = train_test_split(austen_text_procd, sample(al_text_procd, 227),
                                                    test_size=0.25, random_state=7)

In [12]:
# Fit keras model and report score, accuracy
'''
Proviso: The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF + LogReg.
**Notes**
- Choice of batch size is important; choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''

max_features = 2000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80
batch_size = 32

# print('Loading data...')

# print(len(x_train), 'train sequences')
# print(len(x_test), 'test sequences')

# print('Pad sequences (samples x time)')
# x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
# x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
# print('x_train shape:', x_train.shape)
# print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model...
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Train...
Instructions for updating:
Use tf.cast instead.
Train on 170 samples, validate on 57 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: -665.2370369894463
Test accuracy: 0.0


Conclusion - RNN runs, and gives pretty decent improvement over a naive "It's Al!" model. To *really* improve the model, more playing with parameters, and just getting more data (particularly Austen tweets), would help. Also - RNN may well not be the best approach here, but it is at least a valid one.

## Part 2- CNNs

Time to play "find the frog!" Use Keras and ResNet50 to detect which of the following images contain frogs:

In [13]:
!pip install google_images_download



In [14]:
from google_images_download import google_images_download

response = google_images_download.googleimagesdownload()
arguments = {"keywords": "animal pond", "limit": 5, "print_urls": True}
absolute_image_paths = response.download(arguments)


Item no.: 1 --> Item name = animal pond
Evaluating...
Starting Download...
Image URL: https://www.enchantedlearning.com/pgifs/Pondanimals.GIF
Completed Image ====> 1. pondanimals.gif
Image URL: https://i.ytimg.com/vi/NCbu0TND9vE/hqdefault.jpg
Completed Image ====> 2. hqdefault.jpg
Image URL: https://vetstreet-brightspot.s3.amazonaws.com/8d/ac/377fecad46d8820697c26efacc32/koi-pond-thinkstock-153560141-335sm61313.jpg
Completed Image ====> 3. koi-pond-thinkstock-153560141-335sm61313.jpg
Image URL: https://pklifescience.com/staticfiles/articles/images/PKLS4116_inline.png
Completed Image ====> 4. pkls4116_inline.png
Image URL: https://pixnio.com/free-images/fauna-animals/reptiles-and-amphibians/alligators-and-crocodiles-pictures/alligator-animal-on-pond.jpg
Completed Image ====> 5. alligator-animal-on-pond.jpg

Errors: 0



At time of writing at least a few do, but since the Internet changes - it is possible your 5 won't. You can easily verify yourself, and (once you have working code) increase the number of images you pull to be more sure of getting a frog. Your goal is to validly run ResNet50 on the input images - don't worry about tuning or improving the model.

*Hint* - ResNet 50 doesn't just return "frog". The three labels it has for frogs are: `bullfrog, tree frog, tailed frog`

*Stretch goal* - also check for fish.

In [15]:
from IPython.display import Image
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions


def process_img_path(img_path):
    return image.load_img(img_path, target_size=(224, 224))


def img_contains_frog(img):
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    model = ResNet50(weights='imagenet')
    features = model.predict(x)
    results = decode_predictions(features, top=3)[0]
    print(results)
    for entry in results:
        if entry[1] == 'bullfrog' or 'tree frog' or 'tailed frog':
            return entry[2]
    return 0.0

In [16]:
absolute_image_paths

{'animal pond': ['C:\\git\\LSDS\\DS-Unit-4-Sprint-4-Deep-Learning\\downloads\\animal pond\\1. pondanimals.gif',
  'C:\\git\\LSDS\\DS-Unit-4-Sprint-4-Deep-Learning\\downloads\\animal pond\\2. hqdefault.jpg',
  'C:\\git\\LSDS\\DS-Unit-4-Sprint-4-Deep-Learning\\downloads\\animal pond\\3. koi-pond-thinkstock-153560141-335sm61313.jpg',
  'C:\\git\\LSDS\\DS-Unit-4-Sprint-4-Deep-Learning\\downloads\\animal pond\\4. pkls4116_inline.png',
  'C:\\git\\LSDS\\DS-Unit-4-Sprint-4-Deep-Learning\\downloads\\animal pond\\5. alligator-animal-on-pond.jpg']}

In [17]:
procd_images = []
for path in absolute_image_paths['animal pond']:
    image = process_img_path(path)
    if img_contains_frog(image):
        print(image, 'has a frog in it.')

AttributeError: 'Image' object has no attribute 'img_to_array'

In [None]:
import os
os.chdir('downloads\\animal pond')

In [None]:
os.listdir()

In [None]:
Image(filename='1. pondanimals.gif', width=600)

In [None]:
for image in procd_images:
    print(image, img_contains_frog(process_img_path(image)))

## Part 3 - AutoML

Use [TPOT](https://github.com/EpistasisLab/tpot) to fit a predictive model for the King County housing data, with `price` as the target output variable.

In [18]:
!pip install tpot



In [21]:
!pip install wget
!wget https://raw.githubusercontent.com/ryanleeallred/datasets/master/kc_house_data.csv



'wget' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
!head kc_house_data.csv

As with previous questions, your goal is to run TPOT and successfully run and report error at the end.  Also, in the interest of time, feel free to choose small `generation=1` and `population_size=10` parameters so your pipeline runs efficiently and you are able to iterate and test.

*Hint* - you'll have to drop and/or type coerce at least a few variables to get things working. It's fine to err on the side of dropping to get things running, as long as you still get a valid model with reasonable predictive power.

In [22]:
# Imports
import pandas as pd
from tpot import TPOTRegressor

In [None]:
kc_house_prices = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/kc_house_data.csv')
df = kc_house_prices.copy()

In [25]:
# shape is (21613, 21)
kc_house_prices.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [26]:
kc_house_prices = kc_house_prices.drop(['id', 'date', 'grade', 'lat', 'long'], axis=1)

In [27]:
kc_house_prices.isna().sum()

price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [32]:
kc_house_prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 16 columns):
price            21613 non-null float64
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: float64(3), int64(13)
memory usage: 2.6 MB


In [33]:
X = kc_house_prices.drop('price', axis=1)
y = kc_house_prices.price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# Fit model

tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))



HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: -33709012362.469643
Generation 2 - Current best internal CV score: -31000854418.422516
Generation 3 - Current best internal CV score: -27986122886.222393
Generation 4 - Current best internal CV score: -27986122886.222393


## Part 4 - More...

Answer the following questions, with a target audience of a fellow Data Scientist.  
A few sentences per answer is fine - only elaborate if time allows.  
- What do you consider your strongest area, as a Data Scientist?  
A:  Understanding elementary concepts in machine learning and neural networks is my strongest area in data science. I enjoy using machine computing to create electronic classification models that draw on bagged and/or boosted decision tree ensembles. While I need to revisit my calculus studies, I believe I have a good
initial grasp on backpropagation, and I love to consider activation flow.  
 
 
- What area of Data Science would you most like to learn more about, and why?  
A:  Logistic regression is the area. This is because I see so much of natural phenomena (including human
societies) waxing and waning on a sigmoid function.  
  
  
- Where do you think Data Science will be in 5 years?  
A:  I listened closely when the LSDS Program Director told the DS01 cohort about one top SV rideshare company's
take on predictive modeling--that it would be solved in the next 5 years or so. What I wonder about, more
broadly, is if/when humans will enable machines to mechanistically evolve heuristics.


Thank you for your hard work, and congratulations! You've learned a lot, and should proudly call yourself a Data Scientist.